Phi-tiny-MoE — Hexagon v79 + v81 NPU bundle (QHexRT)

microsoft/Phi-tiny-MoE-instruct (PhiMoEForCausalLM, 3.8 B total / 1.1 B active, 16 experts top-2) compiled to run on the Qualcomm Hexagon v79 NPU (Snapdragon 8 Elite / SM8750, e.g. Galaxy S25) through QHexRT — a thin C++ QNN runtime. First MoE in the QHexRT family.

A native, manifest-driven QHexRT bundle: text in → text out via the standard qhx_generate tool, exactly like the other models. Every number below is a real on-device measurement on a Samsung S25.

How it runs

A phimoe_generate host-op drives an AR=1 KV-cache decode: per layer a GQA-native attention+router NPU graph (K/V kept at 4 heads — no repeat_interleave/VTCM spill) → host sparsemixer top-2 → ONE fused 2-expert FFN graph (ffn2: both selected experts dequantized from host-RAM int8 in parallel + m1,m2 → m1·SwiGLU_a + m2·SwiGLU_b in a single execute). It reads only the 2 active experts per token. Quant: int8 experts + fp16 attention / router / lm-head (int16 activations break attention; W4 too coarse). Fits the device in 4 QNN contexts (the HTP caps concurrent contexts ≈ 8). The Phi-3 tokenizer + chat template (<|user|> … <|end|><|assistant|>) are applied on-device, so you pass plain text. MAXCTX = 2048 (16 Q-heads → wide attention HTP-correct past the v79 512 wall — verified 12/12 vs fp32 on a 529-token needle-recall prompt).

Metric	Value
Decode	~5–7 tok/s (top-2; best ~7 cold, thermal-dependent)
Prefill / TTFT	~2.8 s for a 529-token prompt (batched MoE prefill); short prompts (<24 tok) ≈1.5 s
Accuracy	100 % greedy parity vs the fp32 reference — 5 prompts + a 529-token long-context test
Context	2048 tokens
Peak RSS	~6 GB

GQA-native attention + a fused FFN lifted decode 2.5–3.3× over the first 2048 build (2.1 tok/s). A batched MoE prefill (pf_lo/pf_hi + ffn_pf: one forward/layer over the whole prompt + an expert-grouped FFN, seeding the decode KV cache) then cut the 529-token first-token latency from **95 s → ~~2.8 s (~~34×)**. Short prompts skip it; prompts > 576 fall back to decode-over-prompt.

Contents (`v79/`)

file	what	size
`phimoe.json`	the QHexRT manifest (`phimoe_generate` host-op)	small
`a_lo.bin` / `a_hi.bin`	GQA-native attn+router graphs, layers 0–15 / 16–31 (KV cache, MAXCTX 2048)	650 MB each
`ffn2.bin`	fused 2-expert FFN graph (both top-2 experts + m1,m2 → 1 execute)	108 KB
`pf_lo.bin` / `pf_hi.bin`	batched-prefill attn+router graphs, layers 0–15 / 16–31 (causal, PN=576)	659 MB each
`ffn_pf.bin`	batched single-expert FFN for prefill (run once per used expert/layer)	86 KB
`lmhead_ar1.bin`	final LayerNorm + lm-head → logits	251 MB
`experts_i8.bin`	all 512 experts (32×16), per-output-channel int8	2.69 GB
`experts_scale.f32`	int8 dequant scales	9.8 MB
`embed_f16.bin`	token embedding table (host lookup), fp16	251 MB
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`	Phi-3 tokenizer	small

Run

# 1) download
hf download runanywhere/phi_tiny_moe_HNPU --local-dir phi_tiny_moe_HNPU
# 2) build qhx_generate from QHexRT for aarch64-android, push it + the QNN runtime libs
#    (libQnnHtp.so, libQnnSystem.so, the v79 HTP skel) to /data/local/tmp/phimoe
# 3) push this bundle
adb push phi_tiny_moe_HNPU/v79 /data/local/tmp/phimoe         # (PowerShell on Windows — native paths)
# 4) run — plain text in, text out
adb shell "cd /data/local/tmp/phimoe && export ADSP_LIBRARY_PATH='/data/local/tmp/phimoe;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate phimoe.json libQnnHtp.so libQnnSystem.so . 24 'What is the capital of France?'"
# -> "The capital of France is Paris. It is not only the largest city in France ..."

(The manifest also accepts a raw comma-separated token-id list in place of the text prompt, for exact-id repro.)

Caveats

v79 only (SM8750). Another arch = re-export (the build plane / npu-forge).
MAXCTX 2048 (prompt + generation ≤ 2048). The decode-over-prompt prefill is ~0.46 s/prompt-token, so very long prompts take minutes to the first token; decode is ~2.1 tok/s regardless of length. A faster small-window variant (e.g. 512 + sliding KV ring) is a re-export (MAXCTX=512 build_alo_ahi_2048_v2.sh).
Greedy/argmax decode (temperature 0). Weight-only int8 occasionally flips a thin-margin token; output stays coherent (the prior-port finding — int8 is the accuracy/memory sweet spot, W4 too coarse).

Built with the npu-forge toolkit (weights → oracle-gated NPU graphs). Base model + tokenizer © Microsoft (MIT).

v81 (SM8850 / soc_model 87) — DECODE-ONLY

Device-validated on SM8850: "What is the capital of France?" -> "The capital of France is Paris. It is not only the country's largest city but also a global center for art" — coherent, greedy first token 450 (= the PyTorch gold), ~5.7 tok/s decode (matches v79), TTFT 1.7 s for a 10-token prompt (decode-over-prompt).

Why decode-only: the full 7-context bundle (64 graphs incl. batched prefill) crashes the v81 cDSP — the unsigned protection-domain heap is exhausted during Graph::setup_vtcm in libQnnHtpV81Skel.so (fastrpc 0x8000040d AEE_ENOMEMORY -> remoteproc-cdsp fatal, recovery disabled -> full device reboot). NOT host RAM (12.7 GB free). The v81/ bundle therefore ships the 4 decode contexts only (a_lo/a_hi/ffn2/lmhead_ar1); phimoe_generate auto-falls-back to decode-over-prompt (the batched-prefill graphs are optional). Trade-off: slower TTFT on long prompts; identical decode quality/speed. (v79 keeps the full batched-prefill bundle.)

Files (`v81/`)

phimoe.json (decode-only manifest) · a_lo.bin a_hi.bin (decode attn+router a0..a31) · ffn2.bin (fused 2-expert FFN) · lmhead_ar1.bin (final-norm + lm-head, input h) · experts_i8.bin (int8 experts, host) · experts_scale.f32 · embed_f16.bin · tokenizer.json (+ config/special-tokens).

Downloads last month: -

Model tree for runanywhere/phi_tiny_moe_HNPU

Base model

microsoft/Phi-tiny-MoE-instruct

Finetuned

(5)

this model