Tilelli

Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)? Read AGENTS.md first. It has the install path, the verified claims, the verified negative claims (so the agent doesn't repeat them as facts), and the common mistakes other agents have already made on this kit.

A small (~10 M-parameter) byte-level language model with a 3-pathway routed block. Trains and chats out of the box, in either FP32 or ternary mode, on CPU. Part of a family of ternary-first language models (Mosaic, atome-lm, spectrum) that shares the same intent: small, local, ternary-capable, auditable end-to-end.

This kit ships:

  • The architecture in 8 source files (3-pathway + parent multi-pathway)
  • Two trained checkpoints β€” FP32 chat (deployed) and plain ternary pretrain
  • A working trainer that takes a text corpus and a --model flag
  • A ~700 KB demo training dataset (TinyStories slice) so train.py runs end-to-end on CPU in a few minutes
  • Four verification scripts that exit non-zero if our documented numbers don't reproduce against the bundled v4 ckpt

What's in checkpoints/

File Size Precision Architecture Training Use it for
tilelli_chat_v4.pt 39 MB FP32 3-pathway Lite (d=256, L=8) 12K-step FineWeb-Edu pretrain β†’ chat SFT β†’ abstain-aware SFT Chat. Deployed at chat.tilelli.tech. SHA 9f1dcc9465003a…
tilelli_pretrain_v1_ternary.pt 39 MB Ternary {βˆ’1, 0, +1}, STE throughout Parent multi-pathway (d=512, L=7) 50K-step TinyStories pretrain Story continuation. Base for your own ternary SFT. SHA e1b0a263b5c2…

Both are 10M-parameter byte-level. They use different architectural variants of the same family β€” see Β§A note on the two checkpoints below.


Install (CPU, ~120 MB total)

git clone https://github.com/TilelliLab/Tilelli-llm
cd tilelli
# CPU-only torch (avoids 2 GB CUDA wheel on Linux):
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install -e .

See INSTALL.md for macOS / Windows / GPU notes.

Chat

# Talk to the deployed FP32 chat model:
python chat.py "What is the moon?"
# β†’ "i can't answer that. facts like that are beyond a 10m model"

# Or use the generic inference script with either ckpt:
python infer.py --prompt "Hello, who are you?"
# β†’ uses checkpoints/tilelli_chat_v4.pt by default

# Story continuation with the ternary pretrain:
python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
                --prompt "Once upon a time, there was a little"
# β†’ "girl named Lily. She loved to play outside in the snow. One day…"

Train your own β€” FP32 or ternary

The kit ships a small TinyStories slice at data/tinystories_demo/ so you can do a smoke training run immediately:

# FP32, 50 steps on CPU, takes a couple of minutes:
python scripts/train.py --model tilelli-lite-fp32 \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

# Same architecture, ternary forward pass (straight-through estimator):
python scripts/train.py --model tilelli-lite-ternary \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

# Vanilla GPT baseline for A/B comparison:
python scripts/train.py --model vanilla-fp32 \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

For a real training run, point --data-dir at the full TinyStories dataset (or anything else packed as train.bin/valid.bin; see data/tinystories_demo/README.md for the format).

Available --model configs

Name Builder Quantize Shape Param-count
tilelli-lite-fp32 Lite 3-pathway FP32 d=256, L=8 ~10 M
tilelli-lite-ternary Lite 3-pathway Ternary STE d=256, L=8 ~10 M
tilelli-fp32 Parent multi-pathway FP32 d=512, L=7 ~10 M
tilelli-ternary Parent multi-pathway Ternary STE d=512, L=7 ~10 M
vanilla-fp32 Pre-norm transformer baseline FP32 d=320, L=8 ~10 M

Add your own variants by editing MODEL_CFGS in scripts/train.py.


A note on the two checkpoints

Tilelli ships two trained models because we currently have two trained models to ship β€” they are not the same architecture. To be plain about it:

  • tilelli_chat_v4.pt is the deployed chat model that lives at chat.tilelli.tech. It runs the Lite 3-pathway block (local conv + sparse top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet had GPU budget to do a ternary-aware re-training of the chat SFT.

  • tilelli_pretrain_v1_ternary.pt is a 50K-step plain ternary pretrain on TinyStories using the parent multi-pathway block (5-pathway, d=512, L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations rather than answering questions. It demonstrates that the ternary recipe in this kit actually converges to coherent text (val loss 0.6843 on TinyStories byte-LM).

A future ternary-aware re-training of the Lite architecture would give you the same checkpoint twice (FP32 and ternary), which is the artifact we actually want. It's queued.


What works (verified)

# Claim Script Result file
1 Held-out IDK gate: 9 / 10 prompts trigger the abstain template (script PASS gate: β‰₯ 9 β€” verified on bundled v4) reproduce/03_abstain_held_out.py results/claim_03_abstain.md
2 False-inability probe on the bundled set: 7 / 20 trigger refusal reproduce/04_neo_false_inability.py results/claim_04_neo.md
3 Cross-regime ID-vs-OOD AUROC β‰ˆ chance for all 4 signals (max_softmax_mean β‰ˆ 0.54) β€” this is the table the script computes and gates on. Broken down per regime, max_softmax_mean reaches AUROC β‰ˆ 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script). reproduce/02_metacog_probe.py results/claim_02_metacog.md
4 Architecture + checkpoints + trainer work end-to-end on CPU reproduce/01_benchmark.py + pytest tests/ β€”

What doesn't work (verified negative)

# Claim that's wrong What the evidence actually shows
N1 "Router-entropy is an architecture-native metacognition signal" Across 7 OOD regimes Γ— n=30, router-entropy family wins 0 / 7 vs max_softmax_mean.
N2 "Lite beats vanilla 3 / 3 seeds at param-fair" 3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7Οƒ figure was retracted.
N3 "Train an abstain head once, splice it onto any base model" v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular.
N4 "Just turn off the metacog loss and the router will be left alone" CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution β†’ OOD generation collapses.

Reproducing claims

python reproduce/01_benchmark.py            # arch loads, ~10M params (CPU, ~2 s)
python reproduce/03_abstain_held_out.py     # 9 / 10 held-out IDK gate (CPU, ~1 min)
python reproduce/04_neo_false_inability.py  # 7 / 20 false-inability rate (CPU, ~2 min)
python reproduce/02_metacog_probe.py        # cross-regime AUROC sweep (CPU, ~15 min β€” slow)

Each script exits non-zero if the bundled v4 checkpoint fails to produce the documented number within 5 %. If a script doesn't reproduce its claim on your machine, please open an issue.

What's in this repo

Path What it is
src/tilelli/core/ The architecture β€” 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM
src/tilelli/baselines/vanilla.py The pre-norm transformer used for the A/B comparison
src/tilelli/optimisers/ AdamW wrapper + Muon optimizer support
src/tilelli/eval/ Metacog probe + scorer (verifies claim 02)
scripts/train.py Master trainer β€” --model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}
scripts/train_demo.py 5-step CPU smoke; verifies the gradient flows
scripts/prepare_tinystories.py Packs raw TinyStories txt β†’ train.bin/valid.bin
chat.py, infer.py Inference entry points (chat uses v4 + KV cache; infer auto-routes)
checkpoints/ The two ckpts above
data/tinystories_demo/ ~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0)
reproduce/ Four claim-verification scripts
results/ Verified claim docs + audit trail
prompts/probe_210.jsonl 210-prompt evaluation set across 7 regimes
tests/test_kit_smoke.py Three smoke tests (pytest -q tests/)

What's NOT in this repo

  • Spectrum (power-of-3 7-level quantization) β€” separate research line in the source repo's mosaic/spinoffs/spectrum/. Closes ~49 % of the ternaryβ†’FP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
  • The FineWeb-Edu training pipeline + the SFT data that produced v4 β€” private. The minimal training loop bundled here trains on any .bin shards you provide.
  • The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice) β€” available on request via hello@tilelli.tech for negative-result replication.

The actual interesting finding

In a small (10 M-param) routed LM, the metacognition / uncertainty signal does not live in a separable module. We trained 5 variants (v5–v8b) sweeping the metacog-loss weight from 20 β†’ 0, plus a splice (head-only graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on abstain_p) is reached without any explicit metacog loss (v8b, BCE-only) β€” but at the cost of generation quality. The head-only splice preserves generation but the signal collapses (AUROC 0.76 β†’ 0.54).

The signal IS reachable. The module is not liftable. See PAPER_OUTLINE.md for the workshop write-up.

License

Apache 2.0. See LICENSE. The bundled weights and the TinyStories demo slice ship under the same license (TinyStories is CC-BY-4.0; both licenses permit redistribution). The "Tilelli" name is not licensed by this file β€” fork freely; rename if you ship a derivative product.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support