Complete Suite of BPE Tokenizers for TinyStories

This repository contains a full suite of custom BPE tokenizers trained from scratch on 1,000,000 stories (about 200M words) from the roneneldan/TinyStories dataset.

Optimizing vocabulary size is critical when training Small Language Models (SLMs) on resource-constrained hardware like Mac M1 (8GB RAM). This benchmark provides data-driven evidence for choosing the right vocabulary size.

Benchmark Results (Tested on 5,000 Validation Stories)

Vocab Size Total Tokens Compression Ratio Avg Tokens/Story Min Tokens/Story Max Tokens/Story Speed (Tokens/sec)
1024 2,700,793 3.07 270.1 15 1399 ~554,000
1536 2,420,572 3.42 242.1 15 1288 ~508,000
2048 2,284,150 3.63 228.4 15 1243 ~511,000
2560 2,205,438 3.76 220.5 15 1224 ~483,000
3072 2,154,156 3.85 215.4 15 1135 ~480,000
4096 2,086,877 3.97 208.7 15 1117 ~468,000
8192 2,001,284 4.14 200.1 15 1079 ~444,000
16384 1,984,990 4.17 198.5 15 1067 ~443,000
50257 1,980,834 4.18 198.1 15 1067 ~354,000

Key Insights

  1. The 1536-2048 Sweet Spot: Moving from 1024 to 1536/2048 gives the sharpest increase in compression efficiency (3.07 → 3.63). For lightweight models, this range offers the best trade-off between sequence compression and embedding matrix size.
  2. The 4096 Diminishing Returns: A vocabulary size of 4096 captures almost all structure needed for TinyStories (compression ratio 3.97).
  3. The 50257 Overkill: The standard GPT-2 vocabulary (50,257) is massive overkill. Increasing vocabulary size by 12x (compared to 4096) saves a mere 10 tokens per story on average. For TinyStories, larger vocabularies just bloat the model's weights without giving any real context-window benefits.

How to use

Since all files are stored in the root directory, you can load any specific tokenizer directly by specifying the file name:

from tokenizers import Tokenizer

# Load the optimal 2048 tokenizer
tokenizer = Tokenizer.from_pretrained(
    "morginalium/tinystories-tokenizers", 
    filename="tokenizer_2048.json"
)

encoded = tokenizer.encode("Once upon a time, there was a little pup.")
print(encoded.ids)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train morginalium/tinystories-tokenizers