| --- |
| language: |
| - en |
| tags: |
| - audio-text-to-text |
| - speech-translation |
| - speech-understanding |
| - audio |
| - chat |
|
|
|
|
| license: apache-2.0 |
| datasets: |
| - custom |
| metrics: |
| - wer |
| - bleu |
| - AIR-Bench |
| --- |
| <div align="center"> |
| <h1> |
| Soundwave: Less is More for Speech-Text Alignment in LLMs |
| </h1> |
| </div> |
|
|
| <p align="center"> |
| <font size="3"><a href="https://github.com/FreedomIntelligence/Soundwave">🐈⬛ Github</a> | <a href="https://arxiv.org/abs/2502.12900">📃 Paper</a>| <a href="https://huggingface.co/spaces/puccho/Soundwave">📼 Online Demo</a> </font> |
| </p> |
|
|
| ## Model Description |
| Soundwave is a Speech-to-Text model that bridges the gap between speech and text. It is trained on just 10k hours of data and delivers exceptional performance in speech translation and AIR-Bench speech tasks. |
|
|
| ### Key Features |
| <div> |
| <ul> |
| <font size="3"><li>A Speech-to-Text Model Bridging the Gap Between Speech and Text</li></font> |
| <font size="3"><li>Utilizes Data-Efficient Strategy and Unique Architecture, Trained on Only 10k Hours of Data</li></font> |
| <font size="3"><li>Exceptional Performance in Speech Translation and AIR-Bench Speech Tasks</li></font> |
| <font size="3"><li>Retains Intelligence During Conversations, Ideal for Interactive Tasks</li></font> |
| </ul> |
| </div> |
| |
| ## Usage |
| Load the Soundwave model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/Soundwave">GitHub repository</a>. |
|
|
| # <span>📖 Citation</span> |
| ``` |
| @article{zhang2025soundwave, |
| title={Soundwave: Less is More for Speech-Text Alignment in LLMs}, |
| author={Zhang, Yuhao and Liu, Zhiheng and Bu, Fan and Zhang, Ruiyu and Wang, Benyou and Li, Haizhou}, |
| journal={arXiv preprint arXiv:2502.12900}, |
| year={2025} |
| } |
| ``` |