Tilelli

Working with this repo through an AI agent (Cursor / Claude Code / Codex / Aider / ChatGPT)? Read AGENTS.md first. It has the install path, the verified claims, the verified negative claims (so the agent doesn't repeat them as facts), and the common mistakes other agents have already made on this kit.

A small (~10 M-parameter) byte-level language model with a 3-pathway routed block. Trains and chats out of the box, in either FP32 or ternary mode, on CPU. Part of a family of ternary-first language models (Mosaic, atome-lm, spectrum) that shares the same intent: small, local, ternary-capable, auditable end-to-end.

This kit ships:

The architecture in 8 source files (3-pathway + parent multi-pathway)
Two trained checkpoints — FP32 chat (deployed) and plain ternary pretrain
A working trainer that takes a text corpus and a --model flag
A ~700 KB demo training dataset (TinyStories slice) so train.py runs end-to-end on CPU in a few minutes
Four verification scripts that exit non-zero if our documented numbers don't reproduce against the bundled v4 ckpt

What's in `checkpoints/`

File	Size	Precision	Architecture	Training	Use it for
`tilelli_chat_v4.pt`	39 MB	FP32	3-pathway Lite (d=256, L=8)	12K-step FineWeb-Edu pretrain → chat SFT → abstain-aware SFT	Chat. Deployed at chat.tilelli.tech. SHA `9f1dcc9465003a…`
`tilelli_pretrain_v1_ternary.pt`	39 MB	Ternary {−1, 0, +1}, STE throughout	Parent multi-pathway (d=512, L=7)	50K-step TinyStories pretrain	Story continuation. Base for your own ternary SFT. SHA `e1b0a263b5c2…`

Both are 10M-parameter byte-level. They use different architectural variants of the same family — see §A note on the two checkpoints below.

Install (CPU, ~120 MB total)

git clone https://github.com/TilelliLab/Tilelli-llm
cd tilelli
# CPU-only torch (avoids 2 GB CUDA wheel on Linux):
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install -e .

See INSTALL.md for macOS / Windows / GPU notes.

Chat

# Talk to the deployed FP32 chat model:
python chat.py "What is the moon?"
# → "i can't answer that. facts like that are beyond a 10m model"

# Or use the generic inference script with either ckpt:
python infer.py --prompt "Hello, who are you?"
# → uses checkpoints/tilelli_chat_v4.pt by default

# Story continuation with the ternary pretrain:
python infer.py --ckpt checkpoints/tilelli_pretrain_v1_ternary.pt \
                --prompt "Once upon a time, there was a little"
# → "girl named Lily. She loved to play outside in the snow. One day…"

Train your own — FP32 or ternary

The kit ships a small TinyStories slice at data/tinystories_demo/ so you can do a smoke training run immediately:

# FP32, 50 steps on CPU, takes a couple of minutes:
python scripts/train.py --model tilelli-lite-fp32 \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

# Same architecture, ternary forward pass (straight-through estimator):
python scripts/train.py --model tilelli-lite-ternary \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

# Vanilla GPT baseline for A/B comparison:
python scripts/train.py --model vanilla-fp32 \
    --data-dir data/tinystories_demo --steps 50 \
    --batch-size 4 --seq-len 64 --device cpu

For a real training run, point --data-dir at the full TinyStories dataset (or anything else packed as train.bin/valid.bin; see data/tinystories_demo/README.md for the format).

Available `--model` configs

Name	Builder	Quantize	Shape	Param-count
`tilelli-lite-fp32`	Lite 3-pathway	FP32	d=256, L=8	~10 M
`tilelli-lite-ternary`	Lite 3-pathway	Ternary STE	d=256, L=8	~10 M
`tilelli-fp32`	Parent multi-pathway	FP32	d=512, L=7	~10 M
`tilelli-ternary`	Parent multi-pathway	Ternary STE	d=512, L=7	~10 M
`vanilla-fp32`	Pre-norm transformer baseline	FP32	d=320, L=8	~10 M

Add your own variants by editing MODEL_CFGS in scripts/train.py.

A note on the two checkpoints

Tilelli ships two trained models because we currently have two trained models to ship — they are not the same architecture. To be plain about it:

tilelli_chat_v4.pt is the deployed chat model that lives at chat.tilelli.tech. It runs the Lite 3-pathway block (local conv + sparse top-k attention + dense FFN, d=256, L=8). It's FP32 because we haven't yet had GPU budget to do a ternary-aware re-training of the chat SFT.
tilelli_pretrain_v1_ternary.pt is a 50K-step plain ternary pretrain on TinyStories using the parent multi-pathway block (5-pathway, d=512, L=7). It's not chat-SFT'd, so it produces TinyStories-style continuations rather than answering questions. It demonstrates that the ternary recipe in this kit actually converges to coherent text (val loss 0.6843 on TinyStories byte-LM).

A future ternary-aware re-training of the Lite architecture would give you the same checkpoint twice (FP32 and ternary), which is the artifact we actually want. It's queued.

What works (verified)

#	Claim	Script	Result file
1	Held-out IDK gate: 9 / 10 prompts trigger the abstain template (script PASS gate: ≥ 9 — verified on bundled v4)	`reproduce/03_abstain_held_out.py`	`results/claim_03_abstain.md`
2	False-inability probe on the bundled set: 7 / 20 trigger refusal	`reproduce/04_neo_false_inability.py`	`results/claim_04_neo.md`
3	Cross-regime ID-vs-OOD AUROC ≈ chance for all 4 signals (`max_softmax_mean` ≈ 0.54) — this is the table the script computes and gates on. Broken down per regime, `max_softmax_mean` reaches AUROC ≈ 0.93 on gibberish-vs-in-domain (the one working slice; documented in the result file, not recomputed by this script).	`reproduce/02_metacog_probe.py`	`results/claim_02_metacog.md`
4	Architecture + checkpoints + trainer work end-to-end on CPU	`reproduce/01_benchmark.py` + `pytest tests/`	—

What doesn't work (verified negative)

#	Claim that's wrong	What the evidence actually shows
N1	"Router-entropy is an architecture-native metacognition signal"	Across 7 OOD regimes × n=30, router-entropy family wins 0 / 7 vs `max_softmax_mean`.
N2	"Lite beats vanilla 3 / 3 seeds at param-fair"	3 Lite seeds vs 1 vanilla seed (we ran out of RunPod budget). Welch test pending a 3-seed vanilla replication. The 6.7σ figure was retracted.
N3	"Train an abstain head once, splice it onto any base model"	v7's joint-trained abstain head got AUROC 0.76 cross-regime; lifted onto v4's base it dropped to 0.54 with 27 % false-positive rate. Not modular.
N4	"Just turn off the metacog loss and the router will be left alone"	CE on in-domain still backprops through unfrozen router-Linears. 16K updates shift routing distribution → OOD generation collapses.

Reproducing claims

python reproduce/01_benchmark.py            # arch loads, ~10M params (CPU, ~2 s)
python reproduce/03_abstain_held_out.py     # 9 / 10 held-out IDK gate (CPU, ~1 min)
python reproduce/04_neo_false_inability.py  # 7 / 20 false-inability rate (CPU, ~2 min)
python reproduce/02_metacog_probe.py        # cross-regime AUROC sweep (CPU, ~15 min — slow)

Each script exits non-zero if the bundled v4 checkpoint fails to produce the documented number within 5 %. If a script doesn't reproduce its claim on your machine, please open an issue.

What's in this repo

Path	What it is
`src/tilelli/core/`	The architecture — 8 .py files, Lite + parent variants, ternary primitives, hadamard, sparse attention, SSM
`src/tilelli/baselines/vanilla.py`	The pre-norm transformer used for the A/B comparison
`src/tilelli/optimisers/`	AdamW wrapper + Muon optimizer support
`src/tilelli/eval/`	Metacog probe + scorer (verifies claim 02)
`scripts/train.py`	Master trainer — `--model {tilelli-lite-fp32, tilelli-lite-ternary, vanilla-fp32, tilelli-fp32, tilelli-ternary}`
`scripts/train_demo.py`	5-step CPU smoke; verifies the gradient flows
`scripts/prepare_tinystories.py`	Packs raw TinyStories txt → `train.bin`/`valid.bin`
`chat.py`, `infer.py`	Inference entry points (chat uses v4 + KV cache; infer auto-routes)
`checkpoints/`	The two ckpts above
`data/tinystories_demo/`	~700 KB train + ~70 KB valid demo slice (TinyStories CC-BY-4.0)
`reproduce/`	Four claim-verification scripts
`results/`	Verified claim docs + audit trail
`prompts/probe_210.jsonl`	210-prompt evaluation set across 7 regimes
`tests/test_kit_smoke.py`	Three smoke tests (`pytest -q tests/`)

What's NOT in this repo

Spectrum (power-of-3 7-level quantization) — separate research line in the source repo's mosaic/spinoffs/spectrum/. Closes ~49 % of the ternary→FP32 gap but is still ~12 % behind vanilla FP32. Out of scope here.
The FineWeb-Edu training pipeline + the SFT data that produced v4 — private. The minimal training loop bundled here trains on any .bin shards you provide.
The failed metacog ckpts (v5 / v6 / v7 / v8a / v8b / splice) — available on request via hello@tilelli.tech for negative-result replication.

The actual interesting finding

In a small (10 M-param) routed LM, the metacognition / uncertainty signal does not live in a separable module. We trained 5 variants (v5–v8b) sweeping the metacog-loss weight from 20 → 0, plus a splice (head-only graft). The best signal (cross-regime ID-vs-OOD AUROC 0.85 on abstain_p) is reached without any explicit metacog loss (v8b, BCE-only) — but at the cost of generation quality. The head-only splice preserves generation but the signal collapses (AUROC 0.76 → 0.54).

The signal IS reachable. The module is not liftable. See PAPER_OUTLINE.md for the workshop write-up.

License

Apache 2.0. See LICENSE. The bundled weights and the TinyStories demo slice ship under the same license (TinyStories is CC-BY-4.0; both licenses permit redistribution). The "Tilelli" name is not licensed by this file — fork freely; rename if you ship a derivative product.

Downloads last month: -; Downloads are not tracked for this model. How to track