FSMN-VAD — CoreML (Apple Neural Engine)

CoreML conversion of FunASR's FSMN-VAD (~5.2M params), for on-device voice activity detection on Apple Silicon. Upstream: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.

Files

File	Precision	Compute unit	Role
`FsmnVadPreprocessor.mlmodelc`	FP32	CPU	waveform → 400-d features (fbank80 + LFR m=5,n=1 + CMVN)
`FsmnVad.mlmodelc`	FP16	ANE	FSMN scorer → per-frame scores `[1, T, 248]`
`vad_config.json`	—	—	decision params (`sil_pdf_ids`, thresholds)

Pipeline

waveform → [Preprocessor fp32/CPU] → features [1,T,400]
        → [FSMN fp16/ANE] → scores [1,T,248]
        → host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum()  (sil_pdf_ids=[0])
        → state machine (thresholds in vad_config) → speech segments [start_ms, end_ms]

Frame rate: 10 ms (LFR n=1, no downsampling).
The segment decision logic (FunASR FsmnVADStreaming) runs on the host: silence/speech hysteresis with max_end_silence_time (800 ms), max_start_silence_time (3000 ms), max_single_segment_time (60 s), sil_to_speech_time_thres (150 ms). See vad_config.json.

Benchmark — fidelity vs FunASR (FLEURS zh, n=50)

Metric	Value
Frame F1	97.4% (P 100.0% / R 94.8%)
Median RTFx	1209x

Parity: preprocessor matches WavFrontendOnline max|Δ|≈3e-5; FSMN scorer max|Δ| 0.0016. Boundaries match FunASR within ~50 ms.

License

Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support