bg

Atom2.7m

Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.

The main result is on ArithMark 2.0, a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.

The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.

Model Details

  • Architecture: decoder-only GPT
  • Parameters: 2,738,880
  • Layers: 5
  • Hidden size: 192
  • Attention heads: 4
  • KV heads: 2
  • Attention: grouped-query causal self-attention with RoPE and XSA projection
  • Context length: 512
  • Vocabulary size: 4,096
  • Token embeddings: tied input/output embeddings
  • Arithmetic feature embeddings:
    • place_vocab_size: 66
    • role_vocab_size: 12

Tokenizer

Use this model with trust_remote_code=True. The submission includes an AtomTokenizer remote-code wrapper in tokenization_atom.py so standard Hugging Face callers can use AutoTokenizer.from_pretrained(...).

The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:

  • digits 0-9 are atomic and never BPE-merged
  • digit spans are emitted least-significant-digit first
  • + - * / = ( ) are isolated atomic tokens
  • whitespace is isolated from text
  • arithmetic feature IDs are derived by the model from token IDs at inference time

Training and custom tooling may still pass aligned place_ids and role_ids, but generic inference and evaluation only need input_ids and attention_mask.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "."

model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    trust_remote_code=True,
).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True,
)

text = "12 + 34 ="
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    outputs = model(**inputs)

Evaluation

ArithMark 2.0

Use the included benchmark script:

python benchmark_fusion_arithmark.py \
  --checkpoint . \
  --data-path arithmark_2.0.jsonl \
  --batch-size 64 \
  --device cuda \
  --output benchmark_results/fusion_arithmark_2.0_results.json

lm-evaluation-harness

For lm-evaluation-harness tasks, use the standard hf model with remote code enabled:

lm_eval \
  --model hf \
  --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
  --device cuda:0 \
  --batch_size auto:1 \
  --output_path benchmark_results/lm_eval

max_length=548 is passed to the lm-evaluation-harness wrapper so long multiple-choice continuations do not trip the harness assertion that a continuation must fit inside the model window. The tokenizer also advertises model_max_length=548, matching the longest sequence observed in this eval run. The checkpoint was trained with a 512-token context, but the RoPE implementation can score this slightly longer harness window; reduce batch size or set max_length to the longest sequence found if a task variant contains longer continuations.

Results

Benchmark Metric Value
ArithMark 2.0 acc 0.6924
arc_challenge acc_norm 0.2099
arc_easy acc_norm 0.3161
hellaswag acc_norm 0.2701
piqa acc_norm 0.5299

Training Data

The pretraining mixture targeted about 3.5B tokens:

  • Ultra-FineWeb: 900M
  • FineWeb-Edu: 900M
  • FineMath: 450M
  • Cosmopedia-v2: 337.5M
  • UltraData-Math-L2-preview: 337.5M
  • Ultra-FineWeb-L3-en-QA-Synthetic: 225M
  • Synthetic-Arithmetic: 350M

Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as pretraining_curriculum.json.

Limitations

  • This is a very small model and should be treated as an experimental research artifact.
  • Use trust_remote_code=True so AutoTokenizer applies the digit-span transform.
  • Numeric text is represented least-significant-digit first internally.
  • Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.

Files

  • model.safetensors: model weights
  • config.json, config.py, configuration_gpt.py, model.py: custom model code
  • tokenizer.json, tokenization_atom.py: tokenizer files and remote-code wrapper
  • benchmark_fusion_arithmark.py: ArithMark evaluation
  • arithmark_2.0.jsonl: local ArithMark 2.0 data for the standalone benchmark script
  • pretraining_curriculum.json: training curriculum

References / Design Influences

Downloads last month
-
Safetensors
Model size
2.74M params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train UniversalComputingResearch/Atom2.7m

Papers for UniversalComputingResearch/Atom2.7m