OmniVoice — Singing + Emotion Finetune

A finetune of k2-fsa/OmniVoice that adds:

  • [singing] tag — sung speech / nursery-style melodic vocals
  • Emotion tags — [happy], [sad], [angry], [excited], [calm], [nervous], [whisper]
  • Combined tags — e.g. [singing] [happy] ... or [singing] [sad] ...

Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are preserved — the base speech head was protected during finetuning with a continuity mix of plain speech and singing.

Drop-in replacement

This checkpoint is fully compatible with the upstream k2-fsa/OmniVoice code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:

from omnivoice.models.omnivoice import OmniVoice

model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()

# Normal speech (unchanged behavior)
audios = model.generate(
    text="The quick brown fox jumps over the lazy dog.",
    language="English",
)

# Singing
audios = model.generate(
    text="[singing] Twinkle twinkle little star, how I wonder what you are.",
    language="English",
)

# Emotional speech
audios = model.generate(
    text="[happy] I just got the best news of my entire year!",
    language="English",
)

# Combined
audios = model.generate(
    text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
    language="English",
)

import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)

CLI works the same way:

omnivoice-infer --model ModelsLab/omnivoice-singing \
    --text "[happy] Hello there, how wonderful to see you today!" \
    --language English \
    --output out.wav

Supported tags

Tag Source data Strength
[singing] GTSinger English (6,755 clips, ~8 h) strong
[happy] CREMA-D + RAVDESS + Expresso (~2900 clips) strong
[sad] CREMA-D + RAVDESS + Expresso (~2900 clips) strong
[angry] CREMA-D + RAVDESS (~1500 clips) strong
[nervous] CREMA-D fear + RAVDESS fearful (~1400 clips) strong
[whisper] Expresso whisper (~1500 clips) strong
[calm] RAVDESS calm (~190 clips) weak — limited data
[excited] RAVDESS surprised (~190 clips) weak — limited data

Guidance scale of 3.0 (up from default 2.0) is recommended to make tag behavior more pronounced:

audios = model.generate(
    text="[happy] Welcome!",
    language="English",
    guidance_scale=3.0,
)

What's preserved from the base

  • Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
  • Voice cloning from reference audio (ref_audio / ref_text args)
  • Voice design via instruct parameter (pitch / gender / age / accent attributes)
  • Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
  • Speed and duration control (speed / duration args)
  • Built-in non-verbal symbols ([laughter], [sigh], etc.)

Training

Two-stage finetune from k2-fsa/OmniVoice:

Stage 1 — Singing (2500 steps):

  • GTSinger English (6.8k clips, tagged [singing] {lyrics})
  • LibriTTS-R dev+test clean (10k clips, plain text — speech preservation)
  • LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
  • Final eval loss: 4.74

Stage 2 — Emotion (2500 steps, forked from singing/checkpoint-2500):

  • CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
  • 1.5k singing + 1.5k speech continuity samples
  • LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
  • Best eval loss: 4.72 (at step 750) / final 4.88 (step 2500 — this checkpoint, found to sound better subjectively)

This published checkpoint is the final emotion step 2500, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.

Known limitations

  • [calm] and [excited] had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags.
  • Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies.
  • Like the base model, output quality is bounded by the HiggsAudioV2 tokenizer (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.

License

Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:

  • GTSinger: CC BY-NC-SA 4.0 (research use)
  • CREMA-D: ODbL
  • RAVDESS: CC BY-NC-SA 4.0
  • Expresso: CC BY-NC 4.0
  • LibriTTS-R: CC BY 4.0

Acknowledgements

  • k2-fsa/OmniVoice — base model & training framework
  • HiggsAudioV2 — discrete audio tokenizer
  • Qwen team — Qwen3-0.6B backbone
  • Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ModelsLab/omnivoice-singing

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(10)
this model