Shard-40m-v1

A 54.5M parameter dense transformer trained on consumer-grade compute (Thunder Compute pretrain + Colab anneal). Released as a research artifact and pipeline-validation reference. Not a deployable model.

This is the first checkpoint in the Shard series of small experimental transformers.

Architecture

Total params:        54,538,752 (~54.5M)
Hidden dim:          512
Layers:              12
Attention heads:     8 (MHA, no GQA)
Head dim:            64
MLP intermediate:    2048 (SwiGLU)
Vocab size:          8192
Max sequence:        8192
Attention pattern:   Gemma 4 alternating sliding window (window=1024) + global, last layer global
Norm:                RMSNorm, pre-norm
Position encoding:   RoPE on Q and K
Embeddings:          tied input/output
Activation:          SwiGLU
MoE:                 none
Engram:              none

Training

Phase 1 (pretrain):
  Compute:           Thunder Compute single GPU
  Steps:             48,220 of a 100,000 step target (paused early)
  Throughput:        86,800 tokens per second
  Optimizer:         Muon for hidden 2D weights, AdamW for embeddings and norms
  LR schedule:       WSD (warmup-stable-decay)
  Stabilizers:       lm_head logit cap 30, z-loss coefficient 1e-4

Phase 2 (anneal):
  Compute:           Colab A100
  Steps:             20,000 (full anneal complete)
  Final cross-entropy: 3.27
  Mix:               OpenWebMath, FineWeb-Edu carryover, NuminaMath, MetaMathQA, ArXiv, Cosmopedia, AI2 ARC

Files

models/model.pt — anneal final checkpoint (model state only, 105 MB bf16)
models/pretrain.pt — pretrain step 47,500 (with optimizer state, 217 MB)
models/tokenizer.json — custom 8192-vocab BPE
code/ — minimum loading code (model.py, config.py, tokenizer.py, muon.py)

How to load

import sys, torch
sys.path.insert(0, 'code')
from config import Config
from model import ToyLM
from tokenizer import load_tokenizer

ck = torch.load('models/model.pt', map_location='cpu', weights_only=False)
cfg = Config(**ck['cfg']) if isinstance(ck['cfg'], dict) else ck['cfg']
model = ToyLM(cfg).cuda().to(torch.bfloat16)
model.load_state_dict(ck['model'])
model.eval()

tok = load_tokenizer('models/tokenizer.json')
ids = torch.tensor([tok.encode('The capital of France is').ids], device='cuda')
with torch.no_grad():
    for _ in range(40):
        logits, _ = model(ids)
        nxt = logits[:, -1].argmax(-1, keepdim=True)
        ids = torch.cat([ids, nxt], 1)
print(tok.decode(ids[0].tolist()))

Benchmark

Greedy decode at 47 tokens per second on a single CUDA GPU. Model footprint 109 MB in bf16, 16 MB peak inference memory.

Sampled outputs at temperature 0.7, top_p 0.9:

Prompt	Output
`The capital of France is`	`"covered by the Crown" (for example, the Great Seal of France...)`
`To compute 12 plus 7, we can`	`now use the first 6 as a reversible input...`
`Question: What is 23 + 19? Answer:`	`The answer is 23. Answer: 23. Answer: 23` (loops)
`def fibonacci(n):`	`// Appendix A. - S. B. V. Shanker. - S. M. P. Gerber...`
`Once upon a time, in a small village,`	`a woman is a gentleman in a village with an infinite wealth...`
`Solve: 17 * 23 = ?`	`?????\n*****` (breakdown)

What this artifact proves

The training pipeline runs end to end on consumer-grade hardware. Muon + AdamW dual optimizer, WSD schedule, Gemma 4 alternating attention, anneal phase mixing math, code, and prose all stable. Loss decreases monotonically through pretrain. No NaN events, no divergence, no rank loss flagged by the Muon min-singular-value sentinel.

What this artifact cannot do

Math (broken, hallucinates digits or loops). Code generation (gibberish). Factual grounding (hallucinates with grammatical confidence). Long-context retrieval (max sequence 8192 with sliding window 1024 means effective context is much shorter for non-global layers).

Why release it

To document a reproducible recipe at this scale. The next iteration in this line moves to a 412M MoE with 3 routed experts, vocabulary 262144, distillation pretraining from frontier teachers, and a token budget that crosses the Chinchilla line. This artifact is the baseline against which that next model will be measured.

License

Apache 2.0. Use freely. Attribution appreciated but not required.

Citation

@misc{shard40mv1,
  author = {Shane (Crownelius)},
  title  = {Shard-40m-v1: a 54.5M dense transformer trained on consumer compute},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/CompactAI-O/Shard-40m-v1}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including CompactAI-O/Shard-1

Shard series

Collection

Shard series of models (50M) • 1 item • Updated about 4 hours ago • 1