Instructions to use codelion/dhara-250m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codelion/dhara-250m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codelion/dhara-250m", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("codelion/dhara-250m", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use codelion/dhara-250m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codelion/dhara-250m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codelion/dhara-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codelion/dhara-250m

SGLang

How to use codelion/dhara-250m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codelion/dhara-250m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codelion/dhara-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codelion/dhara-250m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codelion/dhara-250m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use codelion/dhara-250m with Docker Model Runner:
```
docker model run hf.co/codelion/dhara-250m
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Dhara-250M — Tri-Mode (AR + Block-Diffusion + Self-Speculation)

A 250M-parameter language model that decodes in three modes from one set of weights, following NVIDIA's Nemotron-Labs-Diffusion: Tri-Mode recipe (joint AR + block-diffusion training). Built from codelion/dhara-250m-ar-base and trained to ~60B cumulative tokens (~50B added for this model). Architecture: LLaMA-style with Canon depthwise-conv layers, QK-norm, logit soft-cap, GQA, RoPE θ=8M.

Demo: dhara-chat Space — chat with it and compare all three decoding modes.

The three modes

Mode	How	Use it for
AR	causal mask, KV-cached `generate()`	highest-quality left-to-right generation
Block-diffusion	block-causal mask, parallel unmasking	lower-latency parallel decoding (quality tradeoff)
Self-speculation	diffusion drafts → AR verifies	AR-quality output at lower latency (lossless-ish)

Usage (transformers — works directly, no extra files)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

ck = "codelion/dhara-250m"
tok = AutoTokenizer.from_pretrained(ck, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ck, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda().eval()
im_end = tok.convert_tokens_to_ids("<|im_end|>")

msgs = [{"role": "user", "content": "Give me three tips for staying healthy."}]
prompt = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt", add_special_tokens=False).input_ids.cuda()

# Mode 1 — AR (recommended for chat); sampling gives the best quality
out = model.generate(ids, max_new_tokens=128, do_sample=True, temperature=0.7,
                     top_p=0.9, repetition_penalty=1.15, eos_token_id=im_end)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

# Mode 2 — block-diffusion (faster, quality tradeoff)
print(tok.decode(model.generate_diffusion(ids, block_len=32, threshold=0.5, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))

# Mode 3 — self-speculation (AR-quality, speedup)
print(tok.decode(model.generate_self_spec(ids, k=8, max_new_tokens=128)[0, ids.shape[1]:], skip_special_tokens=True))

Chat template is ChatML + Hermes-style tools (shipped in the tokenizer); the model supports an OpenAI-style tools=[...] argument.

Benchmarks (lm-eval-harness 0.4.11, identical harness for all models)

10 tasks (9 zero-shot + MMLU 5-shot); metric = acc_norm where defined, else acc. Columns: dhara-base (the AR base, codelion/dhara-250m-ar-base), this model in AR mode and in diffusion mode (dhara-diff), and — as an external reference run through the same harness — SmolLM-135M.

Task	dhara-base (AR)	dhara (AR)	dhara-diff	SmolLM-135M
piqa	57.7	62.6	58.3	68.3
winogrande	50.1	50.0	51.9	53.1
truthfulqa_mc2	50.1	46.4	43.3	39.3
boolq	37.8	37.8	51.2	59.9
openbookqa	32.2	32.4	31.2	33.8
arc_easy	30.2	32.4	40.7	56.2
hellaswag	27.2	33.5	38.3	42.7
arc_challenge	25.6	27.5	24.7	28.8
mmlu (5-shot)	22.9	22.9	25.8	25.9
sciq	21.3	23.3	61.3	74.7
Average	35.5	36.9	42.7	48.3

Tri-mode training improves the AR base by +1.4 points (AR mode) and by +7.2 points in diffusion mode. dhara-diff (42.7) is the headline configuration — bidirectional answer scoring drives large gains over the base on sciq (+40), boolq (+13) and arc_easy (+11).

Data efficiency. SmolLM-135M (48.3) was trained on ~600B tokens — roughly 10× dhara's ~60B (built on the 10B-token pedagogical sutra-10B corpus). Despite that 10× data gap, dhara-diff lands only ~12% below SmolLM-135M on average (42.7 vs 48.3) and wins truthfulqa outright — echoing the data-efficiency results in Scaling Pedagogical Pre-training to 10 Billion Tokens.

Decoding speed (single H100, measured)

Mode	batch-1 latency	peak batched throughput	quality
AR (KV-cached)	61 tok/s	33,067 tok/s (batch 4096)	full
Block-diffusion (thr 0.5)	103 tok/s	~2,200 tok/s (OOM ≥ batch 2048)	quality tradeoff
Self-speculation (k=8)	84 tok/s	~2,200 tok/s	AR-quality (accept ~1.4/8)

Two regimes. At batch 1, block-diffusion and self-speculation are 1.4–1.7× faster than AR — they emit 2.09 / 1.20 tokens per model forward, a single-stream latency win. Batched for throughput, AR wins by ~15×: it is KV-cached (one new token per forward, scaling 61 → 33,067 tok/s from batch 1 → 4096), whereas the diffusion modes re-run a full uncached forward over the whole block each step and saturate memory early. Rule of thumb: reach for diffusion/self-speculation for low-latency single-stream decoding, and for batched AR when you want maximum throughput.

Context length

4k tokens. Config permits 32768 (θ=8M) and the architecture includes YaRN, but the model was only trained to 4k; perplexity is flat to ~4k, mild at 8k, and degrades sharply beyond (16k–32k).

Training

codelion/dhara-250m-ar-base → +30B continued pretraining + 10B high-LR probe on the pedagogical sutra-10B corpus → +7B Stage-2 joint AR+block-diffusion (α=0.3, block 32) → +2B joint SFT (Tulu-3 + Hermes function-calling) → +1B annealing (FineWeb-Edu + chat) ≈ 60B cumulative tokens.

Serving

Please use Hugging Face transformers for serving (from_pretrained(trust_remote_code=True)) — all three modes work directly, with no extra files or setup.

Example

Recommended chat settings (AR mode): do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.15.

Prompt: Give me three tips for staying healthy.

Output:

Firstly, make sure you're eating plenty of fruits and vegetables. These are good sources of vitamins and minerals that help support your immune system and overall health. Additionally, stay hydrated by drinking plenty of water throughout the day. This will help regulate your body's temperature and keep you hydrated.