Abstract
Fish Audio S2 is an open-source text-to-speech system with multi-speaker capabilities, multi-turn generation, and instruction-following control through natural-language descriptions, utilizing a multi-stage training approach and production-ready inference engine.
We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qwen3-TTS Technical Report (2026)
- FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning (2026)
- TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (2026)
- Covo-Audio Technical Report (2026)
- VoiceSculptor: Your Voice, Designed By You (2026)
- Hello-Chat: Towards Realistic Social Audio Interactions (2026)
- MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 3
Collections including this paper 0
No Collection including this paper