Phi-tiny-MoE β Hexagon v79 + v81 NPU bundle (QHexRT)
microsoft/Phi-tiny-MoE-instruct (PhiMoEForCausalLM, 3.8 B total / 1.1 B active, 16 experts top-2) compiled to run on the Qualcomm Hexagon v79 NPU (Snapdragon 8 Elite / SM8750, e.g. Galaxy S25) through QHexRT β a thin C++ QNN runtime. First MoE in the QHexRT family.
A native, manifest-driven QHexRT bundle: text in β text out via the standard
qhx_generatetool, exactly like the other models. Every number below is a real on-device measurement on a Samsung S25.
How it runs
A phimoe_generate host-op drives an AR=1 KV-cache decode: per layer a GQA-native attention+router NPU
graph (K/V kept at 4 heads β no repeat_interleave/VTCM spill) β host sparsemixer top-2 β ONE fused
2-expert FFN graph (ffn2: both selected experts dequantized from host-RAM int8 in parallel + m1,m2 β
m1Β·SwiGLU_a + m2Β·SwiGLU_b in a single execute). It reads only the 2 active experts per token. Quant:
int8 experts + fp16 attention / router / lm-head (int16 activations break attention; W4 too coarse). Fits the
device in 4 QNN contexts (the HTP caps concurrent contexts β 8). The Phi-3 tokenizer + chat template
(<|user|> β¦ <|end|><|assistant|>) are applied on-device, so you pass plain text. MAXCTX = 2048
(16 Q-heads β wide attention HTP-correct past the v79 512 wall β verified 12/12 vs fp32 on a 529-token
needle-recall prompt).
| Metric | Value |
|---|---|
| Decode | ~5β7 tok/s (top-2; best ~7 cold, thermal-dependent) |
| Prefill / TTFT | ~2.8 s for a 529-token prompt (batched MoE prefill); short prompts (<24 tok) β1.5 s |
| Accuracy | 100 % greedy parity vs the fp32 reference β 5 prompts + a 529-token long-context test |
| Context | 2048 tokens |
| Peak RSS | ~6 GB |
GQA-native attention + a fused FFN lifted decode
2.5β3.3Γ over the first 2048 build (2.1 tok/s). A batched MoE prefill (95 s βpf_lo/pf_hi+ffn_pf: one forward/layer over the whole prompt + an expert-grouped FFN, seeding the decode KV cache) then cut the 529-token first-token latency from **2.8 s (34Γ)**. Short prompts skip it; prompts > 576 fall back to decode-over-prompt.
Contents (v79/)
| file | what | size |
|---|---|---|
phimoe.json |
the QHexRT manifest (phimoe_generate host-op) |
small |
a_lo.bin / a_hi.bin |
GQA-native attn+router graphs, layers 0β15 / 16β31 (KV cache, MAXCTX 2048) | 650 MB each |
ffn2.bin |
fused 2-expert FFN graph (both top-2 experts + m1,m2 β 1 execute) | 108 KB |
pf_lo.bin / pf_hi.bin |
batched-prefill attn+router graphs, layers 0β15 / 16β31 (causal, PN=576) | 659 MB each |
ffn_pf.bin |
batched single-expert FFN for prefill (run once per used expert/layer) | 86 KB |
lmhead_ar1.bin |
final LayerNorm + lm-head β logits | 251 MB |
experts_i8.bin |
all 512 experts (32Γ16), per-output-channel int8 | 2.69 GB |
experts_scale.f32 |
int8 dequant scales | 9.8 MB |
embed_f16.bin |
token embedding table (host lookup), fp16 | 251 MB |
tokenizer.json, tokenizer_config.json, special_tokens_map.json |
Phi-3 tokenizer | small |
Run
# 1) download
hf download runanywhere/phi_tiny_moe_HNPU --local-dir phi_tiny_moe_HNPU
# 2) build qhx_generate from QHexRT for aarch64-android, push it + the QNN runtime libs
# (libQnnHtp.so, libQnnSystem.so, the v79 HTP skel) to /data/local/tmp/phimoe
# 3) push this bundle
adb push phi_tiny_moe_HNPU/v79 /data/local/tmp/phimoe # (PowerShell on Windows β native paths)
# 4) run β plain text in, text out
adb shell "cd /data/local/tmp/phimoe && export ADSP_LIBRARY_PATH='/data/local/tmp/phimoe;/vendor/dsp/cdsp'; \
LD_LIBRARY_PATH=. ./qhx_generate phimoe.json libQnnHtp.so libQnnSystem.so . 24 'What is the capital of France?'"
# -> "The capital of France is Paris. It is not only the largest city in France ..."
(The manifest also accepts a raw comma-separated token-id list in place of the text prompt, for exact-id repro.)
Caveats
- v79 only (SM8750). Another arch = re-export (the build plane /
npu-forge). - MAXCTX 2048 (prompt + generation β€ 2048). The decode-over-prompt prefill is ~0.46 s/prompt-token, so very
long prompts take minutes to the first token; decode is ~2.1 tok/s regardless of length. A faster small-window
variant (e.g. 512 + sliding KV ring) is a re-export (
MAXCTX=512 build_alo_ahi_2048_v2.sh). - Greedy/argmax decode (temperature 0). Weight-only int8 occasionally flips a thin-margin token; output stays coherent (the prior-port finding β int8 is the accuracy/memory sweet spot, W4 too coarse).
Built with the npu-forge toolkit (weights β oracle-gated NPU graphs). Base model + tokenizer Β© Microsoft (MIT).
v81 (SM8850 / soc_model 87) β DECODE-ONLY
Device-validated on SM8850: "What is the capital of France?" -> "The capital of France is Paris. It is not only the country's largest city but also a global center for art" β coherent, greedy first token 450 (= the PyTorch gold), ~5.7 tok/s decode (matches v79), TTFT 1.7 s for a 10-token prompt (decode-over-prompt).
Why decode-only: the full 7-context bundle (64 graphs incl. batched prefill) crashes the v81 cDSP β the
unsigned protection-domain heap is exhausted during Graph::setup_vtcm in libQnnHtpV81Skel.so (fastrpc
0x8000040d AEE_ENOMEMORY -> remoteproc-cdsp fatal, recovery disabled -> full device reboot). NOT host RAM
(12.7 GB free). The v81/ bundle therefore ships the 4 decode contexts only (a_lo/a_hi/ffn2/lmhead_ar1);
phimoe_generate auto-falls-back to decode-over-prompt (the batched-prefill graphs are optional). Trade-off:
slower TTFT on long prompts; identical decode quality/speed. (v79 keeps the full batched-prefill bundle.)
Files (v81/)
phimoe.json (decode-only manifest) Β· a_lo.bin a_hi.bin (decode attn+router a0..a31) Β· ffn2.bin (fused
2-expert FFN) Β· lmhead_ar1.bin (final-norm + lm-head, input h) Β· experts_i8.bin (int8 experts, host) Β·
experts_scale.f32 Β· embed_f16.bin Β· tokenizer.json (+ config/special-tokens).
- Downloads last month
- -
Model tree for runanywhere/phi_tiny_moe_HNPU
Base model
microsoft/Phi-tiny-MoE-instruct