FSMN-VAD β CoreML (Apple Neural Engine)
CoreML conversion of FunASR's FSMN-VAD (~5.2M params), for on-device voice activity detection on Apple Silicon. Upstream: iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.
Files
| File | Precision | Compute unit | Role |
|---|---|---|---|
FsmnVadPreprocessor.mlmodelc |
FP32 | CPU | waveform β 400-d features (fbank80 + LFR m=5,n=1 + CMVN) |
FsmnVad.mlmodelc |
FP16 | ANE | FSMN scorer β per-frame scores [1, T, 248] |
vad_config.json |
β | β | decision params (sil_pdf_ids, thresholds) |
Pipeline
waveform β [Preprocessor fp32/CPU] β features [1,T,400]
β [FSMN fp16/ANE] β scores [1,T,248]
β host: silence_prob = softmax(scores)[:, sil_pdf_ids].sum() (sil_pdf_ids=[0])
β state machine (thresholds in vad_config) β speech segments [start_ms, end_ms]
- Frame rate: 10 ms (LFR n=1, no downsampling).
- The segment decision logic (FunASR
FsmnVADStreaming) runs on the host: silence/speech hysteresis withmax_end_silence_time(800 ms),max_start_silence_time(3000 ms),max_single_segment_time(60 s),sil_to_speech_time_thres(150 ms). Seevad_config.json.
Benchmark β fidelity vs FunASR (FLEURS zh, n=50)
| Metric | Value |
|---|---|
| Frame F1 | 97.4% (P 100.0% / R 94.8%) |
| Median RTFx | 1209x |
Parity: preprocessor matches WavFrontendOnline max|Ξ|β3e-5; FSMN scorer max|Ξ| 0.0016. Boundaries match FunASR within ~50 ms.
License
Weights derive from FunASR's FSMN-VAD; upstream license applies. Format conversion only.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support