Dhara-250M — Tri-Mode (AR + Block-Diffusion + Self-Speculation)

A 250M-parameter language model that decodes in three modes from one set of weights, following NVIDIA's Nemotron-Labs-Diffusion: Tri-Mode recipe (joint AR + block-diffusion training). Built from codelion/dhara-250m-ar-base and trained to ~60B cumulative tokens (~50B added for this model). Architecture: LLaMA-style with Canon depthwise-conv layers, QK-norm, logit soft-cap, GQA, RoPE θ=8M.

Demo: dhara-chat Space — chat with it on CPU and compare all three decoding modes.

The three modes

Mode How Use it for
AR causal mask, KV-cached generate() highest-quality left-to-right generation
Block-diffusion block-causal mask, parallel unmasking lower-latency parallel decoding (quality tradeoff)
Self-speculation diffusion drafts → AR verifies AR-quality output at lower latency (lossless-ish)

Usage (transformers — works directly, no extra files)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

ck = "codelion/dhara-250m"
tok = AutoTokenizer.from_pretrained(ck, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ck, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()
im_end = tok.convert_tokens_to_ids("<|im_end|>")

msgs = [{"role": "user", "content": "Give me three tips for staying healthy."}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()

# Mode 1 — AR (recommended for chat); sampling gives the best quality
out = model.generate(ids, max_new_tokens=128, do_sample=True, temperature=0.7,
                     top_p=0.9, repetition_penalty=1.15, eos_token_id=im_end)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

# Mode 2 — block-diffusion (faster, quality tradeoff)
print(tok.decode(model.generate_diffusion(ids, block_len=32, threshold=0.5, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))

# Mode 3 — self-speculation (AR-quality, speedup)
print(tok.decode(model.generate_self_spec(ids, k=8, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))

Chat template is ChatML + Hermes-style tools (shipped in the tokenizer); the model supports an OpenAI-style tools=[...] argument.

Benchmarks (lm-eval-harness 0.4.11, identical harness for all models)

10 tasks (9 zero-shot + MMLU 5-shot); metric = acc_norm where defined, else acc. Columns: dhara-base (the AR base, codelion/dhara-250m-ar-base), this model in AR mode and in diffusion mode (dhara-diff), and — as an external reference run through the same harness — SmolLM-135M.

Task dhara-base (AR) dhara (AR) dhara-diff SmolLM-135M
piqa 57.7 62.6 58.3 68.3
winogrande 50.1 50.0 51.9 53.1
truthfulqa_mc2 50.1 46.4 43.3 39.3
boolq 37.8 37.8 51.2 59.9
openbookqa 32.2 32.4 31.2 33.8
arc_easy 30.2 32.4 40.7 56.2
hellaswag 27.2 33.5 38.3 42.7
arc_challenge 25.6 27.5 24.7 28.8
mmlu (5-shot) 22.9 22.9 25.8 25.9
sciq 21.3 23.3 61.3 74.7
Average 35.5 36.9 42.7 48.3

Tri-mode training improves the AR base by +1.4 points (AR mode) and by +7.2 points in diffusion mode. dhara-diff (42.7) is the headline configuration — bidirectional answer scoring drives large gains over the base on sciq (+40), boolq (+13) and arc_easy (+11).

Data efficiency. SmolLM-135M (48.3) was trained on ~600B tokens — roughly 10× dhara's ~60B (built on the 10B-token pedagogical sutra-10B corpus). Despite that 10× data gap, dhara-diff lands only ~12% below SmolLM-135M on average (42.7 vs 48.3) and wins truthfulqa outright — echoing the data-efficiency results in Scaling Pedagogical Pre-training to 10 Billion Tokens.

Decoding speed (single H100, measured)

Mode batch-1 latency peak batched throughput quality
AR (KV-cached) 61 tok/s 33,067 tok/s (batch 4096) full
Block-diffusion (thr 0.5) 103 tok/s ~2,200 tok/s (OOM ≥ batch 2048) quality tradeoff
Self-speculation (k=8) 84 tok/s ~2,200 tok/s AR-quality (accept ~1.4/8)

Two regimes. At batch 1, block-diffusion and self-speculation are 1.4–1.7× faster than AR — they emit 2.09 / 1.20 tokens per model forward, a single-stream latency win. Batched for throughput, AR wins by ~15×: it is KV-cached (one new token per forward, scaling 61 → 33,067 tok/s from batch 1 → 4096), whereas the diffusion modes re-run a full uncached forward over the whole block each step and saturate memory early. Rule of thumb: reach for diffusion/self-speculation for low-latency single-stream decoding, and for batched AR when you want maximum throughput.

Context length

4k tokens. Config permits 32768 (θ=8M) and the architecture includes YaRN, but the model was only trained to 4k; perplexity is flat to ~4k, mild at 8k, and degrades sharply beyond (16k–32k).

Training

codelion/dhara-250m-ar-base → +30B continued pretraining + 10B high-LR probe on the pedagogical sutra-10B corpus → +7B Stage-2 joint AR+block-diffusion (α=0.3, block 32) → +2B joint SFT (Tulu-3 + Hermes function-calling) → +1B annealing (FineWeb-Edu + chat) ≈ 60B cumulative tokens.

Serving

Please use Hugging Face transformers for serving (from_pretrained(trust_remote_code=True)) — all three modes work directly, with no extra files or setup.

Example

Recommended chat settings (AR mode): do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15.

Prompt: Give me three tips for staying healthy.

Output:

Firstly, make sure you're eating plenty of fruits and vegetables. These are good sources of vitamins and minerals that help support your immune system and overall health. Additionally, stay hydrated by drinking plenty of water throughout the day. This will help regulate your body's temperature and keep you hydrated.

References

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for codelion/dhara-250m

Finetuned
(1)
this model

Datasets used to train codelion/dhara-250m

Space using codelion/dhara-250m 1

Collection including codelion/dhara-250m