Instructions to use codelion/dhara-250m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codelion/dhara-250m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="codelion/dhara-250m", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("codelion/dhara-250m", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use codelion/dhara-250m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codelion/dhara-250m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codelion/dhara-250m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/codelion/dhara-250m
- SGLang
How to use codelion/dhara-250m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "codelion/dhara-250m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codelion/dhara-250m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "codelion/dhara-250m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codelion/dhara-250m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use codelion/dhara-250m with Docker Model Runner:
docker model run hf.co/codelion/dhara-250m
Dhara-250M — Tri-Mode (AR + Block-Diffusion + Self-Speculation)
A 250M-parameter language model that decodes in three modes from one set of weights, following NVIDIA's Nemotron-Labs-Diffusion: Tri-Mode recipe (joint AR + block-diffusion training). Built from codelion/dhara-250m-ar-base and trained to ~60B cumulative tokens (~50B added for this model). Architecture: LLaMA-style with Canon depthwise-conv layers, QK-norm, logit soft-cap, GQA, RoPE θ=8M.
Demo: dhara-chat Space — chat with it on CPU and compare all three decoding modes.
The three modes
| Mode | How | Use it for |
|---|---|---|
| AR | causal mask, KV-cached generate() |
highest-quality left-to-right generation |
| Block-diffusion | block-causal mask, parallel unmasking | lower-latency parallel decoding (quality tradeoff) |
| Self-speculation | diffusion drafts → AR verifies | AR-quality output at lower latency (lossless-ish) |
Usage (transformers — works directly, no extra files)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
ck = "codelion/dhara-250m"
tok = AutoTokenizer.from_pretrained(ck, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ck, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()
im_end = tok.convert_tokens_to_ids("<|im_end|>")
msgs = [{"role": "user", "content": "Give me three tips for staying healthy."}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()
# Mode 1 — AR (recommended for chat); sampling gives the best quality
out = model.generate(ids, max_new_tokens=128, do_sample=True, temperature=0.7,
top_p=0.9, repetition_penalty=1.15, eos_token_id=im_end)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
# Mode 2 — block-diffusion (faster, quality tradeoff)
print(tok.decode(model.generate_diffusion(ids, block_len=32, threshold=0.5, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))
# Mode 3 — self-speculation (AR-quality, speedup)
print(tok.decode(model.generate_self_spec(ids, k=8, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))
Chat template is ChatML + Hermes-style tools (shipped in the tokenizer); the model supports an OpenAI-style tools=[...] argument.
Benchmarks (lm-eval-harness 0.4.11, identical harness for all models)
10 tasks (9 zero-shot + MMLU 5-shot); metric = acc_norm where defined, else acc. Columns: dhara-base (the AR base, codelion/dhara-250m-ar-base), this model in AR mode and in diffusion mode (dhara-diff), and — as an external reference run through the same harness — SmolLM-135M.
| Task | dhara-base (AR) | dhara (AR) | dhara-diff | SmolLM-135M |
|---|---|---|---|---|
| piqa | 57.7 | 62.6 | 58.3 | 68.3 |
| winogrande | 50.1 | 50.0 | 51.9 | 53.1 |
| truthfulqa_mc2 | 50.1 | 46.4 | 43.3 | 39.3 |
| boolq | 37.8 | 37.8 | 51.2 | 59.9 |
| openbookqa | 32.2 | 32.4 | 31.2 | 33.8 |
| arc_easy | 30.2 | 32.4 | 40.7 | 56.2 |
| hellaswag | 27.2 | 33.5 | 38.3 | 42.7 |
| arc_challenge | 25.6 | 27.5 | 24.7 | 28.8 |
| mmlu (5-shot) | 22.9 | 22.9 | 25.8 | 25.9 |
| sciq | 21.3 | 23.3 | 61.3 | 74.7 |
| Average | 35.5 | 36.9 | 42.7 | 48.3 |
Tri-mode training improves the AR base by +1.4 points (AR mode) and by +7.2 points in diffusion mode. dhara-diff (42.7) is the headline configuration — bidirectional answer scoring drives large gains over the base on sciq (+40), boolq (+13) and arc_easy (+11).
Data efficiency. SmolLM-135M (48.3) was trained on ~600B tokens — roughly 10× dhara's ~60B (built on the 10B-token pedagogical sutra-10B corpus). Despite that 10× data gap, dhara-diff lands only ~12% below SmolLM-135M on average (42.7 vs 48.3) and wins truthfulqa outright — echoing the data-efficiency results in Scaling Pedagogical Pre-training to 10 Billion Tokens.
Decoding speed (single H100, measured)
| Mode | batch-1 latency | peak batched throughput | quality |
|---|---|---|---|
| AR (KV-cached) | 61 tok/s | 33,067 tok/s (batch 4096) | full |
| Block-diffusion (thr 0.5) | 103 tok/s | ~2,200 tok/s (OOM ≥ batch 2048) | quality tradeoff |
| Self-speculation (k=8) | 84 tok/s | ~2,200 tok/s | AR-quality (accept ~1.4/8) |
Two regimes. At batch 1, block-diffusion and self-speculation are 1.4–1.7× faster than AR — they emit 2.09 / 1.20 tokens per model forward, a single-stream latency win. Batched for throughput, AR wins by ~15×: it is KV-cached (one new token per forward, scaling 61 → 33,067 tok/s from batch 1 → 4096), whereas the diffusion modes re-run a full uncached forward over the whole block each step and saturate memory early. Rule of thumb: reach for diffusion/self-speculation for low-latency single-stream decoding, and for batched AR when you want maximum throughput.
Context length
4k tokens. Config permits 32768 (θ=8M) and the architecture includes YaRN, but the model was only trained to 4k; perplexity is flat to ~4k, mild at 8k, and degrades sharply beyond (16k–32k).
Training
codelion/dhara-250m-ar-base → +30B continued pretraining + 10B high-LR probe on the pedagogical sutra-10B corpus → +7B Stage-2 joint AR+block-diffusion (α=0.3, block 32) → +2B joint SFT (Tulu-3 + Hermes function-calling) → +1B annealing (FineWeb-Edu + chat) ≈ 60B cumulative tokens.
Serving
Please use Hugging Face transformers for serving (from_pretrained(trust_remote_code=True)) — all three modes work directly, with no extra files or setup.
Example
Recommended chat settings (AR mode): do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15.
Prompt: Give me three tips for staying healthy.
Output:
Firstly, make sure you're eating plenty of fruits and vegetables. These are good sources of vitamins and minerals that help support your immune system and overall health. Additionally, stay hydrated by drinking plenty of water throughout the day. This will help regulate your body's temperature and keep you hydrated.
References
- Recipe: Nemotron-Labs-Diffusion: a Tri-Mode Language Model (NVIDIA, 2026).
- Pedagogical pre-training: Scaling Pedagogical Pre-training to 10 Billion Tokens.
- Pre-training data: codelion/sutra-10B.
- Downloads last month
- -
Model tree for codelion/dhara-250m
Base model
codelion/dhara-250m-ar-base