roneneldan/TinyStories
Viewer • Updated • 2.14M • 87.1k • 1.02k
This repository contains a full suite of custom BPE tokenizers trained from scratch on 1,000,000 stories (about 200M words) from the roneneldan/TinyStories dataset.
Optimizing vocabulary size is critical when training Small Language Models (SLMs) on resource-constrained hardware like Mac M1 (8GB RAM). This benchmark provides data-driven evidence for choosing the right vocabulary size.
| Vocab Size | Total Tokens | Compression Ratio | Avg Tokens/Story | Min Tokens/Story | Max Tokens/Story | Speed (Tokens/sec) |
|---|---|---|---|---|---|---|
| 1024 | 2,700,793 | 3.07 | 270.1 | 15 | 1399 | ~554,000 |
| 1536 | 2,420,572 | 3.42 | 242.1 | 15 | 1288 | ~508,000 |
| 2048 | 2,284,150 | 3.63 | 228.4 | 15 | 1243 | ~511,000 |
| 2560 | 2,205,438 | 3.76 | 220.5 | 15 | 1224 | ~483,000 |
| 3072 | 2,154,156 | 3.85 | 215.4 | 15 | 1135 | ~480,000 |
| 4096 | 2,086,877 | 3.97 | 208.7 | 15 | 1117 | ~468,000 |
| 8192 | 2,001,284 | 4.14 | 200.1 | 15 | 1079 | ~444,000 |
| 16384 | 1,984,990 | 4.17 | 198.5 | 15 | 1067 | ~443,000 |
| 50257 | 1,980,834 | 4.18 | 198.1 | 15 | 1067 | ~354,000 |
Since all files are stored in the root directory, you can load any specific tokenizer directly by specifying the file name:
from tokenizers import Tokenizer
# Load the optimal 2048 tokenizer
tokenizer = Tokenizer.from_pretrained(
"morginalium/tinystories-tokenizers",
filename="tokenizer_2048.json"
)
encoded = tokenizer.encode("Once upon a time, there was a little pup.")
print(encoded.ids)