Complete Suite of BPE Tokenizers for TinyStories

This repository contains a full suite of custom BPE tokenizers trained from scratch on 1,000,000 stories (about 200M words) from the roneneldan/TinyStories dataset.

Optimizing vocabulary size is critical when training Small Language Models (SLMs) on resource-constrained hardware like Mac M1 (8GB RAM). This benchmark provides data-driven evidence for choosing the right vocabulary size.

Benchmark Results (Tested on 5,000 Validation Stories)

Vocab Size	Total Tokens	Compression Ratio	Avg Tokens/Story	Min Tokens/Story	Max Tokens/Story	Speed (Tokens/sec)
1024	2,700,793	3.07	270.1	15	1399	~554,000
1536	2,420,572	3.42	242.1	15	1288	~508,000
2048	2,284,150	3.63	228.4	15	1243	~511,000
2560	2,205,438	3.76	220.5	15	1224	~483,000
3072	2,154,156	3.85	215.4	15	1135	~480,000
4096	2,086,877	3.97	208.7	15	1117	~468,000
8192	2,001,284	4.14	200.1	15	1079	~444,000
16384	1,984,990	4.17	198.5	15	1067	~443,000
50257	1,980,834	4.18	198.1	15	1067	~354,000

Key Insights

The 1536-2048 Sweet Spot: Moving from 1024 to 1536/2048 gives the sharpest increase in compression efficiency (3.07 → 3.63). For lightweight models, this range offers the best trade-off between sequence compression and embedding matrix size.
The 4096 Diminishing Returns: A vocabulary size of 4096 captures almost all structure needed for TinyStories (compression ratio 3.97).
The 50257 Overkill: The standard GPT-2 vocabulary (50,257) is massive overkill. Increasing vocabulary size by 12x (compared to 4096) saves a mere 10 tokens per story on average. For TinyStories, larger vocabularies just bloat the model's weights without giving any real context-window benefits.

How to use

Since all files are stored in the root directory, you can load any specific tokenizer directly by specifying the file name:

from tokenizers import Tokenizer

# Load the optimal 2048 tokenizer
tokenizer = Tokenizer.from_pretrained(
    "morginalium/tinystories-tokenizers", 
    filename="tokenizer_2048.json"
)

encoded = tokenizer.encode("Once upon a time, there was a little pup.")
print(encoded.ids)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

morginalium
/

tinystories-tokenizers

Complete Suite of BPE Tokenizers for TinyStories

Benchmark Results (Tested on 5,000 Validation Stories)

Key Insights

How to use

Dataset used to train morginalium/tinystories-tokenizers