Atom2.7m

Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.

The main result is on ArithMark 2.0, a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.

The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.

Model Details

Architecture: decoder-only GPT
Parameters: 2,738,880
Layers: 5
Hidden size: 192
Attention heads: 4
KV heads: 2
Attention: grouped-query causal self-attention with RoPE and XSA projection
Context length: 512
Vocabulary size: 4,096
Token embeddings: tied input/output embeddings
Arithmetic feature embeddings:
- place_vocab_size: 66
- role_vocab_size: 12

Tokenizer

Use this model with trust_remote_code=True. The submission includes an AtomTokenizer remote-code wrapper in tokenization_atom.py so standard Hugging Face callers can use AutoTokenizer.from_pretrained(...).

The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:

digits 0-9 are atomic and never BPE-merged
digit spans are emitted least-significant-digit first
+ - * / = ( ) are isolated atomic tokens
whitespace is isolated from text
arithmetic feature IDs are derived by the model from token IDs at inference time

Training and custom tooling may still pass aligned place_ids and role_ids, but generic inference and evaluation only need input_ids and attention_mask.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "."

model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True,
)

text = "12 + 34 ="
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    outputs = model(**inputs)

Evaluation

ArithMark 2.0

Use the included benchmark script:

python benchmark_fusion_arithmark.py \
  --checkpoint . \
  --data-path arithmark_2.0.jsonl \
  --batch-size 64 \
  --device cuda \
  --output benchmark_results/fusion_arithmark_2.0_results.json

lm-evaluation-harness

For lm-evaluation-harness tasks, use the standard hf model with remote code enabled:

lm_eval \
  --model hf \
  --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
  --device cuda:0 \
  --batch_size auto:1 \
  --output_path benchmark_results/lm_eval

max_length=548 is passed to the lm-evaluation-harness wrapper so long multiple-choice continuations do not trip the harness assertion that a continuation must fit inside the model window. The tokenizer also advertises model_max_length=548, matching the longest sequence observed in this eval run. The checkpoint was trained with a 512-token context, but the RoPE implementation can score this slightly longer harness window; reduce batch size or set max_length to the longest sequence found if a task variant contains longer continuations.

Results

Benchmark	Metric	Value
ArithMark 2.0	acc	0.6924
arc_challenge	acc_norm	0.2099
arc_easy	acc_norm	0.3161
hellaswag	acc_norm	0.2701
piqa	acc_norm	0.5299

Training Data

The pretraining mixture targeted about 3.5B tokens:

Ultra-FineWeb: 900M
FineWeb-Edu: 900M
FineMath: 450M
Cosmopedia-v2: 337.5M
UltraData-Math-L2-preview: 337.5M
Ultra-FineWeb-L3-en-QA-Synthetic: 225M
Synthetic-Arithmetic: 350M

Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as pretraining_curriculum.json.

Limitations

This is a very small model and should be treated as an experimental research artifact.
Use trust_remote_code=True so AutoTokenizer applies the digit-span transform.
Numeric text is represented least-significant-digit first internally.
Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.

Files

model.safetensors: model weights
config.json, config.py, configuration_gpt.py, model.py: custom model code
tokenizer.json, tokenization_atom.py: tokenizer files and remote-code wrapper
benchmark_fusion_arithmark.py: ArithMark evaluation
arithmark_2.0.jsonl: local ArithMark 2.0 data for the standalone benchmark script
pretraining_curriculum.json: training curriculum