English

TOK-16K Tokenizer

Overview

TOK-16K is a 16,384 vocabulary byte-level BPE tokenizer designed for multilingual reasoning, arithmetic understanding, structured data representation, and code-aware language modeling. It is optimized for small-scale transformer architectures (~1M–10M parameters) and is intended for use in curriculum-trained language models.

The tokenizer is designed to be stable, extensible, and reusable across both small-scale and future large-scale models.


Key Design Goals

  • Compact vocabulary (16,384 tokens) for efficient embedding utilization in low-parameter models
  • Strong support for structured reasoning formats
  • Compatibility with arithmetic, code, and natural language mixtures
  • Future-proof special token design for instruction tuning and reasoning traces
  • Efficient byte-level tokenization for robust handling of unknown or noisy text
  • Stable behavior across varying model sizes (from 512 to 4096+ context windows)

Tokenization Algorithm

TOK-16K uses:

  • Byte-Level BPE (Byte Pair Encoding)
  • Unicode normalization (NFKC)
  • Byte-level pre-tokenization
  • Fast decoding via ByteLevel decoder

This ensures:

  • No out-of-vocabulary issues
  • Robust handling of code and symbols
  • Stable token boundaries for arithmetic expressions

Vocabulary Size

  • Total Vocabulary Size: 16,384 tokens
  • Reserved system tokens included

This vocabulary size is chosen to balance:

  • Representation capacity
  • Embedding efficiency
  • Compute constraints for sub-5M parameter models

Special Tokens

The tokenizer includes the following special tokens:

Core System Tokens

  • [PAD] : Padding token for batch alignment
  • [UNK] : Unknown token fallback
  • [BOS] : Beginning of sequence
  • [EOS] : End of sequence
  • [MASK] : Masking token (reserved for future training stages)

Reasoning and Structured Tokens

  • <MATH> : Marks mathematical problem statements
  • <EXPR> : Represents mathematical expressions or transformations
  • <ANSWER> : Denotes final answer section
  • <THINK> : Reserved token for reasoning traces (not used in pretraining)
  • </THINK> : End of reasoning trace segment

Instruction / Dialogue Tokens

  • <SYSTEM> : System-level instructions
  • <USER> : User input segment
  • <ASSISTANT> : Model output segment
  • <CODE> : Code segment marker

Extended Future-Proof Tokens

These tokens are included for future training scalability:

  • <SOT> : Start of task
  • <EOT> : End of task

Model Compatibility

TOK-16K is compatible with:

  • Decoder-only Transformer architectures
  • Tied embedding models
  • SwiGLU-based feed-forward networks
  • RMSNorm-based normalization
  • Rotary Position Embeddings (RoPE)

It has been designed specifically for:

  • Sub-5M parameter models
  • Small transformer research models
  • Curriculum-trained reasoning models

Context Length Compatibility

The tokenizer is independent of model context length.

It supports:

  • 512 token models
  • 1024 token models
  • 2048 token models
  • 4096+ token models

Note: Model context length is defined during training and is NOT enforced by the tokenizer.


Training Data Compatibility

TOK-16K has been trained and validated for use with:

Natural Language Data

  • High-quality curated text datasets
  • Cosmopedia-style synthetic corpora

Code Data

  • Python code blocks
  • Structured programming datasets

Mathematical Data

  • Arithmetic expressions
  • Step-by-step reasoning tasks
  • Structured equation transformations

Mixed Curriculum Data

  • Interleaved language + math + code datasets
  • Instruction-style formatted datasets

Design Philosophy

TOK-16K is built around the principle that:

Reasoning quality is determined more by structured data formatting than vocabulary size.

Therefore, it emphasizes:

  • Explicit structured tokens for reasoning formats
  • Clear separation between expression, reasoning, and answer
  • Minimal but expressive special token set
  • Consistent formatting across datasets

Technical Specifications

  • Tokenization Method: Byte-Level BPE
  • Normalization: NFKC
  • Pre-tokenizer: ByteLevel
  • Decoder: ByteLevel
  • Vocabulary Size: 16,384
  • Special Tokens: Yes
  • Padding Side: Right
  • Truncation Side: Right

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("GODELEV/TOK-16K")

text = "<MATH> 17 * 19 <ANSWER> 323"
tokens = tokenizer(text)

print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train GODELEV/TOK-16K