OmniVoice — Singing + Emotion Finetune

A finetune of k2-fsa/OmniVoice that adds:

[singing] tag — sung speech / nursery-style melodic vocals
Emotion tags — [happy], [sad], [angry], [excited], [calm], [nervous], [whisper]
Combined tags — e.g. [singing] [happy] ... or [singing] [sad] ...

Original OmniVoice capabilities (multilingual zero-shot TTS, voice cloning, voice design, 600+ languages) are preserved — the base speech head was protected during finetuning with a continuity mix of plain speech and singing.

Drop-in replacement

This checkpoint is fully compatible with the upstream k2-fsa/OmniVoice code — same architecture (Qwen3-0.6B LM + HiggsAudioV2 audio tokenizer at 24 kHz), same inference API. Replace the model id:

from omnivoice.models.omnivoice import OmniVoice

model = OmniVoice.from_pretrained("ModelsLab/omnivoice-singing").to("cuda").eval()

# Normal speech (unchanged behavior)
audios = model.generate(
    text="The quick brown fox jumps over the lazy dog.",
    language="English",
)

# Singing
audios = model.generate(
    text="[singing] Twinkle twinkle little star, how I wonder what you are.",
    language="English",
)

# Emotional speech
audios = model.generate(
    text="[happy] I just got the best news of my entire year!",
    language="English",
)

# Combined
audios = model.generate(
    text="[singing] [sad] Quiet rain falls on the stone, memories of days now gone.",
    language="English",
)

import soundfile as sf
sf.write("out.wav", audios[0], model.sampling_rate)

CLI works the same way:

omnivoice-infer --model ModelsLab/omnivoice-singing \
    --text "[happy] Hello there, how wonderful to see you today!" \
    --language English \
    --output out.wav

Supported tags

Tag	Source data	Strength
`[singing]`	GTSinger English (6,755 clips, ~8 h)	strong
`[happy]`	CREMA-D + RAVDESS + Expresso (~2900 clips)	strong
`[sad]`	CREMA-D + RAVDESS + Expresso (~2900 clips)	strong
`[angry]`	CREMA-D + RAVDESS (~1500 clips)	strong
`[nervous]`	CREMA-D fear + RAVDESS fearful (~1400 clips)	strong
`[whisper]`	Expresso whisper (~1500 clips)	strong
`[calm]`	RAVDESS calm (~190 clips)	weak — limited data
`[excited]`	RAVDESS surprised (~190 clips)	weak — limited data

Guidance scale of 3.0 (up from default 2.0) is recommended to make tag behavior more pronounced:

audios = model.generate(
    text="[happy] Welcome!",
    language="English",
    guidance_scale=3.0,
)

What's preserved from the base

Multilingual TTS (English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Russian, Hindi, Gujarati, etc.)
Voice cloning from reference audio (ref_audio / ref_text args)
Voice design via instruct parameter (pitch / gender / age / accent attributes)
Fine-grained pronunciation control (pinyin / CMU phoneme overrides)
Speed and duration control (speed / duration args)
Built-in non-verbal symbols ([laughter], [sigh], etc.)

Training

Two-stage finetune from k2-fsa/OmniVoice:

Stage 1 — Singing (2500 steps):

GTSinger English (6.8k clips, tagged [singing] {lyrics})
LibriTTS-R dev+test clean (10k clips, plain text — speech preservation)
LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
Final eval loss: 4.74

Stage 2 — Emotion (2500 steps, forked from singing/checkpoint-2500):

CREMA-D + RAVDESS + Expresso read config (10.8k emotion clips)
1.5k singing + 1.5k speech continuity samples
LR 3e-5 cosine, bf16, 2 GPUs, batch_tokens=8192
Best eval loss: 4.72 (at step 750) / final 4.88 (step 2500 — this checkpoint, found to sound better subjectively)

This published checkpoint is the final emotion step 2500, which subjectively produces the cleanest emotional tag behavior while preserving speech/singing quality.

Known limitations

[calm] and [excited] had only ~190 training samples each (only one dataset contributed) — behavior is weaker than the other emotion tags.
Cross-language singing (sung Hindi, Gujarati, etc.) is extrapolation — works but quality varies.
Like the base model, output quality is bounded by the HiggsAudioV2 tokenizer (24 kHz, ~2 kbps, speech-domain tuned). Music / drum content is not supported by design.

License

Apache 2.0. Downstream users must also comply with the individual licenses of the training datasets:

GTSinger: CC BY-NC-SA 4.0 (research use)
CREMA-D: ODbL
RAVDESS: CC BY-NC-SA 4.0
Expresso: CC BY-NC 4.0
LibriTTS-R: CC BY 4.0

Acknowledgements

k2-fsa/OmniVoice — base model & training framework
HiggsAudioV2 — discrete audio tokenizer
Qwen team — Qwen3-0.6B backbone
Dataset authors: GTSinger, CREMA-D, RAVDESS, Expresso, LibriTTS-R teams

Downloads last month: 997

Safetensors

Model size

0.6B params

Tensor type

I64

F32

Model tree for ModelsLab/omnivoice-singing

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

k2-fsa/OmniVoice

Finetuned

(36)

this model