OmniVoice — Singing + Emotion Finetune
A finetune of k2-fsa/OmniVoice that adds:
[singing]tag — sung speech / nursery-style melodic vocals- Emotion tags —
[happy],[sad],[angry],[excited],[calm],[nervous],[whisper] - Combined tags — e.g.
[singing] [happy] ...or[singing] [sad] ...
Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are preserved — the base speech head was protected during finetuning with a continuity mix of plain speech and singing.
Drop-in replacement
This checkpoint is fully compatible with the upstream k2-fsa/OmniVoice code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:
from omnivoice.models.omnivoice import OmniVoice
model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()
# Normal speech (unchanged behavior)
audios = model.generate(
text="The quick brown fox jumps over the lazy dog.",
language="English",
)
# Singing
audios = model.generate(
text="[singing] Twinkle twinkle little star, how I wonder what you are.",
language="English",
)
# Emotional speech
audios = model.generate(
text="[happy] I just got the best news of my entire year!",
language="English",
)
# Combined
audios = model.generate(
text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
language="English",
)
import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)
CLI works the same way:
omnivoice-infer --model ModelsLab/omnivoice-singing \
--text "[happy] Hello there, how wonderful to see you today!" \
--language English \
--output out.wav
Supported tags
| Tag | Source data | Strength |
|---|---|---|
[singing] |
GTSinger English (6,755 clips, ~8 h) | strong |
[happy] |
CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
[sad] |
CREMA-D + RAVDESS + Expresso (~2900 clips) | strong |
[angry] |
CREMA-D + RAVDESS (~1500 clips) | strong |
[nervous] |
CREMA-D fear + RAVDESS fearful (~1400 clips) | strong |
[whisper] |
Expresso whisper (~1500 clips) | strong |
[calm] |
RAVDESS calm (~190 clips) | weak — limited data |
[excited] |
RAVDESS surprised (~190 clips) | weak — limited data |
Guidance scale of 3.0 (up from default 2.0) is recommended to make tag behavior more pronounced:
audios = model.generate(
text="[happy] Welcome!",
language="English",
guidance_scale=3.0,
)
What's preserved from the base
- Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
- Voice cloning from reference audio (
ref_audio/ref_textargs) - Voice design via
instructparameter (pitch / gender / age / accent attributes) - Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
- Speed and duration control (
speed/durationargs) - Built-in non-verbal symbols (
[laughter],[sigh], etc.)
Training
Two-stage finetune from k2-fsa/OmniVoice:
Stage 1 — Singing (2500 steps):
- GTSinger English (6.8k clips, tagged
[singing] {lyrics}) - LibriTTS-R dev+test clean (10k clips, plain text — speech preservation)
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Final eval loss: 4.74
Stage 2 — Emotion (2500 steps, forked from singing/checkpoint-2500):
- CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
- 1.5k singing + 1.5k speech continuity samples
- LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
- Best eval loss: 4.72 (at step 750) / final 4.88 (step 2500 — this checkpoint, found to sound better subjectively)
This published checkpoint is the final emotion step 2500, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.
Known limitations
[calm]and[excited]had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags.- Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies.
- Like the base model, output quality is bounded by the HiggsAudioV2 tokenizer (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.
License
Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:
- GTSinger: CC BY-NC-SA 4.0 (research use)
- CREMA-D: ODbL
- RAVDESS: CC BY-NC-SA 4.0
- Expresso: CC BY-NC 4.0
- LibriTTS-R: CC BY 4.0
Acknowledgements
- k2-fsa/OmniVoice — base model & training framework
- HiggsAudioV2 — discrete audio tokenizer
- Qwen team — Qwen3-0.6B backbone
- Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams
- Downloads last month
- -