Phi-tiny-MoE β€” Hexagon v79 + v81 NPU bundle (QHexRT)

microsoft/Phi-tiny-MoE-instruct (PhiMoEForCausalLM, 3.8 B total / 1.1 B active, 16 experts top-2) compiled to run on the Qualcomm Hexagon v79 NPU (Snapdragon 8 Elite / SM8750, e.g. Galaxy S25) through QHexRT β€” a thin C++ QNN runtime. First MoE in the QHexRT family.

A native, manifest-driven QHexRT bundle: text in β†’ text out via the standard qhx_generate tool, exactly like the other models. Every number below is a real on-device measurement on a Samsung S25.

How it runs

A phimoe_generate host-op drives an AR=1 KV-cache decode: per layer a GQA-native attention+router NPU graph (K/V kept at 4 heads β€” no repeat_interleave/VTCM spill) β†’ host sparsemixer top-2 β†’ ONE fused 2-expert FFN graph (ffn2: both selected experts dequantized from host-RAM int8 in parallel + m1,m2 β†’ m1Β·SwiGLU_a + m2Β·SwiGLU_b in a single execute). It reads only the 2 active experts per token. Quant: int8 experts + fp16 attention / router / lm-head (int16 activations break attention; W4 too coarse). Fits the device in 4 QNN contexts (the HTP caps concurrent contexts β‰ˆ 8). The Phi-3 tokenizer + chat template (<|user|> … <|end|><|assistant|>) are applied on-device, so you pass plain text. MAXCTX = 2048 (16 Q-heads β†’ wide attention HTP-correct past the v79 512 wall β€” verified 12/12 vs fp32 on a 529-token needle-recall prompt).

Metric Value
Decode ~5–7 tok/s (top-2; best ~7 cold, thermal-dependent)
Prefill / TTFT ~2.8 s for a 529-token prompt (batched MoE prefill); short prompts (<24 tok) β‰ˆ1.5 s
Accuracy 100 % greedy parity vs the fp32 reference β€” 5 prompts + a 529-token long-context test
Context 2048 tokens
Peak RSS ~6 GB

GQA-native attention + a fused FFN lifted decode 2.5–3.3Γ— over the first 2048 build (2.1 tok/s). A batched MoE prefill (pf_lo/pf_hi + ffn_pf: one forward/layer over the whole prompt + an expert-grouped FFN, seeding the decode KV cache) then cut the 529-token first-token latency from **95 s β†’ 2.8 s (34Γ—)**. Short prompts skip it; prompts > 576 fall back to decode-over-prompt.

Contents (v79/)

file what size
phimoe.json the QHexRT manifest (phimoe_generate host-op) small
a_lo.bin / a_hi.bin GQA-native attn+router graphs, layers 0–15 / 16–31 (KV cache, MAXCTX 2048) 650 MB each
ffn2.bin fused 2-expert FFN graph (both top-2 experts + m1,m2 β†’ 1 execute) 108 KB
pf_lo.bin / pf_hi.bin batched-prefill attn+router graphs, layers 0–15 / 16–31 (causal, PN=576) 659 MB each
ffn_pf.bin batched single-expert FFN for prefill (run once per used expert/layer) 86 KB
lmhead_ar1.bin final LayerNorm + lm-head β†’ logits 251 MB
experts_i8.bin all 512 experts (32Γ—16), per-output-channel int8 2.69 GB
experts_scale.f32 int8 dequant scales 9.8 MB
embed_f16.bin token embedding table (host lookup), fp16 251 MB
tokenizer.json, tokenizer_config.json, special_tokens_map.json Phi-3 tokenizer small

Run

# 1) download
hf download runanywhere/phi_tiny_moe_HNPU --local-dir phi_tiny_moe_HNPU
# 2) build qhx_generate from QHexRT for aarch64-android, push it + the QNN runtime libs
#    (libQnnHtp.so, libQnnSystem.so, the v79 HTP skel) to /data/local/tmp/phimoe
# 3) push this bundle
adb push phi_tiny_moe_HNPU/v79 /data/local/tmp/phimoe         # (PowerShell on Windows β€” native paths)
# 4) run β€” plain text in, text out
adb shell "cd /data/local/tmp/phimoe && export ADSP_LIBRARY_PATH='/data/local/tmp/phimoe;/vendor/dsp/cdsp'; \
  LD_LIBRARY_PATH=. ./qhx_generate phimoe.json libQnnHtp.so libQnnSystem.so . 24 'What is the capital of France?'"
# -> "The capital of France is Paris. It is not only the largest city in France ..."

(The manifest also accepts a raw comma-separated token-id list in place of the text prompt, for exact-id repro.)

Caveats

  • v79 only (SM8750). Another arch = re-export (the build plane / npu-forge).
  • MAXCTX 2048 (prompt + generation ≀ 2048). The decode-over-prompt prefill is ~0.46 s/prompt-token, so very long prompts take minutes to the first token; decode is ~2.1 tok/s regardless of length. A faster small-window variant (e.g. 512 + sliding KV ring) is a re-export (MAXCTX=512 build_alo_ahi_2048_v2.sh).
  • Greedy/argmax decode (temperature 0). Weight-only int8 occasionally flips a thin-margin token; output stays coherent (the prior-port finding β€” int8 is the accuracy/memory sweet spot, W4 too coarse).

Built with the npu-forge toolkit (weights β†’ oracle-gated NPU graphs). Base model + tokenizer Β© Microsoft (MIT).

v81 (SM8850 / soc_model 87) β€” DECODE-ONLY

Device-validated on SM8850: "What is the capital of France?" -> "The capital of France is Paris. It is not only the country's largest city but also a global center for art" β€” coherent, greedy first token 450 (= the PyTorch gold), ~5.7 tok/s decode (matches v79), TTFT 1.7 s for a 10-token prompt (decode-over-prompt).

Why decode-only: the full 7-context bundle (64 graphs incl. batched prefill) crashes the v81 cDSP β€” the unsigned protection-domain heap is exhausted during Graph::setup_vtcm in libQnnHtpV81Skel.so (fastrpc 0x8000040d AEE_ENOMEMORY -> remoteproc-cdsp fatal, recovery disabled -> full device reboot). NOT host RAM (12.7 GB free). The v81/ bundle therefore ships the 4 decode contexts only (a_lo/a_hi/ffn2/lmhead_ar1); phimoe_generate auto-falls-back to decode-over-prompt (the batched-prefill graphs are optional). Trade-off: slower TTFT on long prompts; identical decode quality/speed. (v79 keeps the full batched-prefill bundle.)

Files (v81/)

phimoe.json (decode-only manifest) Β· a_lo.bin a_hi.bin (decode attn+router a0..a31) Β· ffn2.bin (fused 2-expert FFN) Β· lmhead_ar1.bin (final-norm + lm-head, input h) Β· experts_i8.bin (int8 experts, host) Β· experts_scale.f32 Β· embed_f16.bin Β· tokenizer.json (+ config/special-tokens).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for runanywhere/phi_tiny_moe_HNPU

Finetuned
(5)
this model