You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Revolab VITS — Multi-Speaker Bahasa Melayu TTS

Piper TTS voice models trained on Revolab Malay speech datasets using VITS architecture.

Speakers (56 trained)

Production Quality (CER < 10%)

Speaker ID	Name	Samples	CER	WER
sarah	sarah	27,792	0.0402	0.1856
pendakwah	pendakwah	46,222	0.0619	0.1847
pendakwah_teknologi	Pendakwah Teknologi	46,222	0.0619	0.1847
paan	Paan	27,434	0.0630	0.1759
anwar	AI	100	0.0732	0.2101

Usable Quality (CER 50-65%)

Speaker ID	Name	Samples	CER	WER
G0095_0001	G0095	12,317	0.5382	0.7685
1614-NorinaYahya	Noriya	27,962	0.5490	0.8351
8-enhanced-v2	Sarah	27,991	0.6084	0.9322
Iqbal_25	Iqbal	11,077	0.6107	0.8428
JackLim_1	Jacky	9,143	0.6390	0.9050

Low Quality (CER > 75%)

Speaker ID	Name	Samples	CER	WER
Angry_2	Pearl (Angry 2)	27,999	0.7556	0.9965
cv_971b3c0c6dbd	cv_971b3c0c6dbd	6,632	0.7557	0.9845
Surprise	Surprise	5,986	0.7604	0.9998
G0068_0003	G0068	12,657	0.7670	0.9976
G0004_0012	G0004	12,656	0.7756	1.0228
G0016_0001	G0016	12,694	0.7760	1.0000
cv_06a7ed020ffa	cv_06a7ed020ffa	7,050	0.7771	0.9975
Dania_22	Dania	13,558	0.7777	0.9869
2394-Digi-Mohon-Share-Amel-BM	Amel	29,273	0.7784	1.0086
cv_0c744927a83d	cv_0c744927a83d	7,461	0.7786	0.9988
cv_e6ca744ffbb2	cv_e6ca744ffbb2	6,604	0.7813	0.9869
cv_1a952f7ad4bd	cv_1a952f7ad4bd	6,587	0.7829	1.0155
berani_buat	berani_buat	8,657	0.7853	0.9967
Happy	Pearl (Happy)	12,317	0.7854	1.0000
Sad	Pearl (Sad)	7,202	0.7855	0.9907
k-lao-ZH	k lao ZH	5,533	0.7875	1.0042
cv_a2b722b2d474	cv_a2b722b2d474	5,805	0.7887	0.9907
Sabil_2	Kamal	8,276	0.7918	1.0093
cv_51062240a78a	cv_51062240a78a	6,929	0.7944	1.0170
mych_male_yt_13887	mych male yt 13887	5,967	0.7954	1.0098
sugumaran_6	sugumaran 6	5,630	0.7969	1.0052
Alvin_6	Alvin	27,996	0.7989	1.0199
cv_9d9664719e9d	cv_9d9664719e9d	7,020	0.8016	1.0319
Angry_1	Pearl (Angry 1)	27,849	0.8026	0.9905
cv_bd96fb646e95	cv_bd96fb646e95	6,552	0.8061	0.9896
cv_c0b0ce94caa2	cv_c0b0ce94caa2	7,053	0.8104	1.0093
niketh_23	niketh 23	5,767	0.8177	1.0074
angellyn	angellyn	6,096	0.8185	1.0000
hendrick-22	Hendrick	6,774	0.8185	0.9880
Farid_596	Faiz	5,216	0.8574	0.9966

Non-Malay Speakers (no eval)

Speaker ID	Name	Samples	Language
Akari_Kitou	Akari Kitou	6,772	Japanese
Aoi_Yuuki	Aoi Yuuki	6,772	Japanese
Ayana_Taketatsu	Ayana Taketatsu	10,970	Japanese
Celine-ZH	Celine Zh	10,967	Chinese
ChangYong	Changyong	16,195	Chinese
Eri_Kitamura	Eri Kitamura	5,209	Japanese
Haruka_Tomatsu	Haruka Tomatsu	6,758	Japanese
Kana_Hanazawa	Kana Hanazawa	8,778	Japanese
Maaya_Uchida	Maaya Uchida	8,368	Japanese
Nao_Touyama	Nao Touyama	8,400	Japanese
Rie_Kugimiya	Rie Kugimiya	15,001	Japanese
Rie_Takahashi	Rie Takahashi	13,534	Japanese
Rina_Satou	Rina Satou	12,969	Japanese
Sora_Amamiya	Sora Amamiya	6,901	Japanese

Evaluation Methodology

Evaluated using WhisperX large-v3 transcription + revo-norm text normalization on 36 sample texts covering:

Digits/currency (RM10.50, 03-8888 9999)
Malaysian locations (Jalan Ampang, Pulau Pinang)
Names — Malay/Chinese/Indian (Encik Ahmad, Lim Wei Jie, Rajesh Kumar)
English code-switching (Meeting kita pukul 3 petang)
Dates/time (30hb Jun 2024, pukul 8 pagi)
Addresses (25, Jalan Taman Indah 3/4, 43000 Kajang)
Financial terms (Baki akaun RM5,670.23)
General conversation

CER = Character Error Rate (lower is better). WER = Word Error Rate.

Full per-speaker results: eval_all/<speaker>_eval.csv

Install

pip install piper-tts
pip install git+https://github.com/khursanirevo/revo-norm.git

Quick Start

from piper import PiperVoice
from revo_norm import normalize_text

raw = "Harga RM10.50 sahaja"
text = normalize_text(raw, language="ms")

voice = PiperVoice.load("hf_repo/speakers/sarah/model.onnx",
                         config_path="hf_repo/speakers/sarah/model.onnx.json")
audio = voice.synthesize(text)

Text Normalization

revo-norm handles Malay-specific normalization:

Numbers → words ("115" → "seratus lima belas")
Currency ("RM10.50" → "sepuluh ringgit lima puluh sen")
Dates, phone numbers, percentages, temperatures, etc.

Structure

speakers.json              # Speaker registry with eval metrics
speakers/
  <name>/
    model.onnx             # ONNX export for inference
    model.onnx.json        # Phoneme config

Performance (CPU)

Metric	Value
Avg latency	~54ms
Avg RTF	0.030
Speed	33.6x realtime

Training

All models trained with:

Architecture: VITS
Phonemizer: espeak-ng (ms voice)
Sample rate: 22050Hz
GPU: NVIDIA H200 NVL

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support