FSMN-VAD β€” CoreML (Apple Neural Engine)

CoreML conversion of FunASR's FSMN-VAD (~5.2M params), for on-device voice activity detection on Apple Silicon. Upstream: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.

Files

File Precision Compute unit Role
FsmnVadPreprocessor.mlmodelc FP32 CPU waveform β†’ 400-d features (fbank80 + LFR m=5,n=1 + CMVN)
FsmnVad.mlmodelc FP16 ANE FSMN scorer β†’ per-frame scores [1, T, 248]
vad_config.json β€” β€” decision params (sil_pdf_ids, thresholds)

Pipeline

waveform β†’ [Preprocessor fp32/CPU] β†’ features [1,T,400]
        β†’ [FSMN fp16/ANE] β†’ scores [1,T,248]
        β†’ host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum()  (sil_pdf_ids=[0])
        β†’ state machine (thresholds in vad_config) β†’ speech segments [start_ms, end_ms]
  • Frame rate: 10 ms (LFR n=1, no downsampling).
  • The segment decision logic (FunASR FsmnVADStreaming) runs on the host: silence/speech hysteresis with max_end_silence_time (800 ms), max_start_silence_time (3000 ms), max_single_segment_time (60 s), sil_to_speech_time_thres (150 ms). See vad_config.json.

Benchmark β€” fidelity vs FunASR (FLEURS zh, n=50)

Metric Value
Frame F1 97.4% (P 100.0% / R 94.8%)
Median RTFx 1209x

Parity: preprocessor matches WavFrontendOnline max|Ξ”|β‰ˆ3e-5; FSMN scorer max|Ξ”| 0.0016. Boundaries match FunASR within ~50 ms.

License

Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support