SenseVoiceSmall โ€” CoreML (Apple Neural Engine)

CoreML conversion of FunAudioLLM/SenseVoiceSmall for on-device inference on Apple Silicon, intended for FluidInference/FluidAudio (tracks issues #645 / #646).

SenseVoiceSmall is a non-autoregressive multilingual ASR model (~234M params, SANM encoder + single CTC head) covering 50+ languages, with emotion and audio-event tags. One forward pass yields all output tokens.

Files (3-stage pipeline)

File Precision Compute unit Size Role
SenseVoicePreprocessor.mlmodelc FLOAT32 CPU 3 MB front-end: waveform โ†’ 560-d LFR features
SenseVoiceSmall.mlmodelc FLOAT16 CPU_AND_NE (ANE) 447 MB default encoder+CTC
SenseVoiceSmall_int8.mlmodelc INT8 (weights) CPU_AND_NE (ANE) 225 MB ~half size, accuracy-neutral
SenseVoiceSmall_fp32.mlmodelc FLOAT32 any 897 MB encoder fallback (non-ANE)
vocab.json โ€” โ€” โ€” 25055 SentencePiece tokens (array form)

int8 is post-training weight quantization (linear_symmetric), accuracy-neutral vs fp16 on the full canonical sets: LibriSpeech test-clean WER 3.22โ†’3.25% (2,620), AISHELL-1 test CER 3.09โ†’3.09% (7,176) โ€” ฮ” +0.03 pp / 0.00 pp, 0 NaN on ANE, peak RAM 0.54โ†’0.32 GB. Pick it for ~half the on-disk/memory footprint.

Pipeline: waveform โ†’ [Preprocessor, fp32/CPU] โ†’ features โ†’ [encoder+CTC, fp16/ANE] โ†’ logits โ†’ host greedy-CTC decode.

โš ๏ธ Compute-unit requirement. The FLOAT16 encoder is numerically correct on the Neural Engine but produces NaN on the CPU/GPU fp16 path. Load it with MLModelConfiguration.computeUnits = .cpuAndNeuralEngine. On hardware without ANE (or under ANE fallback), use SenseVoiceSmall_fp32. The preprocessor must run fp32 (power-spectrum/log exceed fp16 range).

I/O

SenseVoicePreprocessor โ€” in: waveform [1, N] fp32 (16 kHz, scaled ร—32768 like kaldi; flexible length). out: features [1, T, 560] fp32.

SenseVoiceSmall (encoder+CTC):

name shape dtype notes
speech [1, T, 560] fp32 preprocessor output; T โˆˆ enumerated buckets [128,256,512,1024,1800] (pad up)
speech_lengths [1] int32 valid frame count (before padding)
language [1] int32 embed index; 0 = auto
textnorm [1] int32 15 = no inverse text-norm (woitn), 14 = withitn

Output: ctc_logits [1, T+4, 25055] โ€” the 4 leading positions are the language/emotion/event/itn query tokens; the rest are the transcript.

Host pre/post-processing

Pre: handled by SenseVoicePreprocessor (kaldi fbank80 โ†’ LFR m=7,n=6 โ†’ CMVN, matching FunASR WavFrontend to max|ฮ”|โ‰ˆ2e-5). Pad its output up to the smallest encoder bucket โ‰ฅ T.

Post (decode): greedy CTC over ctc_logits โ†’ collapse repeats โ†’ drop blank (id 0) โ†’ SentencePiece detokenize โ†’ strip <|...|> tags for the clean transcript. Reference Python in the repo's decode.py.

language/textnorm are embed indices, mapped on the host:

lid_int_dict      = {24884:3, 24885:4, 24888:7, 24892:11, 24896:12, 24992:13}  # <|zh|> etc -> embed idx
textnorm_int_dict = {25016:14, 25017:15}
# language not in dict -> 0 (auto)

Verification & benchmarks

Conversion = PyTorch (FunASR) โ†’ torch.jit.trace โ†’ coremltools (FLOAT16, EnumeratedShapes, iOS17). Measured on this machine (M-series), FunASR 1.3.9 / coremltools 8.3.

  • End-to-end correctness: on the cached zh sample, the CoreML(ANE) โ†’ greedy-CTC pipeline reproduces FunASR am.generate exactly: <|zh|><|NEUTRAL|><|Speech|><|woitn|>ๆฌข่ฟŽๅคงๅฎถๆฅไฝ“้ชŒ่พพๆ‘ฉ้™ขๆŽจๅ‡บ็š„่ฏญ้Ÿณ่ฏ†ๅˆซๆจกๅž‹

  • Parity (torch โ†” CoreML, ANE): CTC argmax token agreement 100% on real audio.

  • LibriSpeech test-clean (canonical โ€” matches the official chart): CoreML(ANE) 3.21% WER (torch 3.26%) on n=100 vs the published SenseVoice-Small ~3.1%. Confirms the full pipeline (front-end + CoreML + decode) reproduces the paper. (Full 2620-utt split number: see repo README.)

  • FLEURS WER (CoreML ANE vs torch), 100 samples/lang โ€” conversion is accuracy-neutral:

    lang torch CoreML (ANE) ฮ” RTFx
    en_us (WER) 9.52% 9.52% +0.00pp 402
    cmn_hans_cn (CER) 9.60% 9.57% โˆ’0.03pp 372

    FLEURS is a harder/different read-speech set than LibriSpeech/Aishell โ€” its absolute numbers are not comparable to the official benchmark chart; it's used here only for cross-language CoreMLโ†”torch parity.

  • RTFx (5.55 s clip, by bucket, ANE): 128โ†’524, 256โ†’274, 512โ†’97, 1024โ†’36, 1800โ†’14.5. (M-series; iPhone ANE not yet measured.)

License & attribution

Weights derive from FunAudioLLM/SenseVoiceSmall; the upstream model license applies. This repo only contains a format conversion (no retraining). See the SenseVoice and FunASR projects.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support