Final recipe locked: Qwen3-MoE, 3 experts top-1, vocab 262144 (Gemma 3 SP, per-digit input wrap), GQA 3:1, Muon for hidden 2D weights and AdamW for embed and router, WSD with sqrt cooldown, beta2 ramp from 0.95 to 0.97, z-loss 1e-4 with gradients this time (the last build had a no_grad bug that silently killed it), Qwen3 aux loss coefficient 0.001, expert-load monitor that warns on starvation. Three phases: 8K pretrain, then 32K continued pretrain, then 8K SFT.
Shane PRO
Crownelius
AI & ML interests
LLM, RL, DL, ML, AGI, Distillation, Workflow Automation, Creative Writing
Recent Activity
updated a model 1 day ago
CompactAI-O/Shard-1 published a model 1 day ago
CompactAI-O/Shard-1 repliedto their post 1 day ago
Day 3 - 05/02/2026
Scamp ships, hits the wall. New plan...
Scamp came back from training today... Didn't go so well, I'm still unsure...
Fast benchmark, temperature 0.7, top_p 0.9:
- "Capital of France is" produced "covered by the Crown" (grammatical, factually wrong)
- "23 + 19 = ?" produced "23. Answer: 23. Answer: 23..." (loops, math broken)
- "def fibonacci(n):" produced a list of letters
It speaks English. It can't reason. At 8K vocab and 50M params, it was never going to.
Next build: 412M MoE-3E. Three experts (math, language, code), top-1 routing, random init, let specialization emerge from gradient signal alone. Tried seeded Branch-Train-MiX first then dropped it. Adds compute for no clear win when the router will find its own attractors anyway.
Big lesson today came from limit testing on A100 80GB. Surprise, every planned phase ran out of memory even on 80GB. Root cause: at vocab 262144 (Gemma 3 standard), the output logits dominate during forward and backward. Fix: Liger Kernel's fused cross-entropy. It streams the loss computation instead of materialising the full B by T by vocab tensor. Without it the build would not run.
Scamp proved the pipeline runs end-to-end on real hardware. The 412M run starts tomorrow. If routing balances naturally and math finally crystallises, ships as Crowfeather-412M-3E with GGUF in F16, Q8, Q5, and Q4.
So... the training may have produced a poet if I had done it better. But I didn't, so instead... we get a malformed robot named Scamp... This is progress.
-Shane
P.S Join discord for discussion: https://discord.gg/8ZscHNmJYE and
I post my finished stuff here: https://huggingface.co/CompactAI-O