Parakeet-TDT-0.6B-v3 β LiteRT (TFLite) port
LiteRT (TFLite) port of
nvidia/parakeet-tdt-0.6b-v3,
packaged for on-device inference (Android / Mac / embedded) without a Python
or NeMo runtime dependency.
For model capabilities, languages, training data, license, and benchmarks, see the upstream model card. This card only documents what's specific to the LiteRT port.
What's in this bundle
| File | Size | Purpose |
|---|---|---|
encoder_T1500.tflite |
1.15 GB | FP16 encoder, fixed T_mel = 1500 (15 s window) |
decoder_step.tflite |
23 MB | Single-step LSTM prediction network |
joint_step.tflite |
12 MB | TDT joint network (token + duration logits) |
tokenizer.model |
353 KB | SentencePiece BPE tokenizer (vocab=8192) |
manifest.json |
β | All metadata the runtime needs |
Total: ~1.18 GB (FP16). FP32 reference is ~2.37 GB.
Encoder I/O contract
inputs:
audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
length : int32 [1] # actual mel frames used (β€ 1500)
outputs:
encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
encoded_lengths : int32 [1]
Pad shorter inputs with zeros at the tail (the encoder was trained with audio anchored at position 0; left-padding causes hallucinations) and pass the true length.
The 1500-mel bucket covers β€ 15 s of audio. For long-form input, run the encoder in a sliding-window streaming loop β see "Streaming usage" below.
Why int32, not int64. LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
NPU accelerator) reject int64 tensors entirely. With int64 length, every
internal CAST node touching it falls back to CPU, and CompiledModel.create()
fails outright on Android with the GPU backend. This bundle is exported with
int32 length end-to-end (input β internal mask arange/comparisons β output
encoded_lengths). int32 covers > 2 billion mel frames (~5 hours of audio),
so no practical range loss.
Why a single bucket and not multi-signature
An earlier revision shipped a multi-signature encoder with 4 buckets
(300/500/700/1500) sharing weights inside one .tflite. The disk savings
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
the LiteRT CompiledModel.create() API prepares every signature's
subgraph at load time β each one going through the full delegate-partition
pass. With 4 signatures Γ ~7 s of XNNPACK / GPU partition prep, app cold
start was ~28 s.
A single-bucket file is one subgraph: ~7 s init, then ready. If you need
multiple bucket sizes for latency reasons, ship them as separate .tflite
files (TFLite has no cross-file weight sharing) and load on demand.
Decoder + joint contract
decoder_step:
inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
joint_step:
inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
outputs: logits float32 [1,1,1,8198]
# logits[..., 0:8193] β token logits (8192 BPE + 1 blank)
# logits[..., 8193:8198] β duration logits over [0,1,2,3,4]
decoder_step.token is int64 because it's an embedding lookup; that op
runs on CPU regardless of delegate, so int64 there is harmless.
Greedy TDT decoding (per encoder frame):
- Run joint with current
enc_frameand last predictedpred_frame. token = argmax(token_logits);dur = durations[argmax(duration_logits)] β {0,1,2,3,4}.- If
token != blank_id (8192): emit token, advancedurencoder frames, re-prime decoder with the emitted token (h, c update). - Else: advance
max(dur, 1)encoder frames; do not advance the decoder. - Repeat until
enc_lengthsis exhausted.
Cap at ~10 non-blank emissions per encoder frame to guard against the
pathological dur=0 decode loop.
Audio preprocessing
LiteRT itself does not produce mel features β your runtime must compute them. Match NeMo's preprocessor exactly:
sample_rate : 16000 Hz (resample if needed)
n_fft : 512
hop_length : 160 β 100 mel frames / second
win_length : 400
n_mels : 128
preemph : 0.97
log : log(mel + 1e-5), per-feature normalized
mel_scale : slaney
Encoder frame rate after the 8Γ subsampler: 12.5 fps (1 enc frame = 80 ms).
Streaming usage
This bundle supports chunked streaming inference using a left+chunk+right
context window that fits inside 15 s. A reference Python implementation is
in the upstream repo (transcribe_litert_streaming.py). Recommended config
for Android UX:
| Knob | Value | Reason |
|---|---|---|
chunk_seconds |
5 | committed per step |
left_context_seconds |
5 | encoder bilateral context |
right_context_seconds |
2 | end-to-end latency β 7 s |
window total |
12 s | (5 + 5 + 2) Γ 100 = 1200 mel β€ 1500 |
carry_state |
false | offline-trained model; carrying LSTM state across chunks tends to hurt |
We measured ~27 % WER on multilingual long-form audio (EN/ES/IT code-switching) with this config, ~22 % on clean offline β€15 s English.
Quantization
- All
.tfliteweights are FP16. Activations remain FP32. - Bit-identical token output vs the upstream FP32 model on a 99-clip eval set.
Conversion provenance
Built from upstream nvidia/parakeet-tdt-0.6b-v3.nemo via:
- NeMo β torch.export ExportedProgram (per encoder/decoder/joint module).
- ExportedProgram β TFLite via
litert-torch0.8.0. - FP32 β FP16 via
ai_edge_quantizerFLOAT_CASTINGalgorithm on FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
Several NeMo internals required export-time monkey-patches:
MaskedConvSequential.{forward,_create_mask}andapply_channel_maskβ to remove.expand(...)patterns rejected by the TFLite broadcast checker.RelPositionMultiHeadAttentionLongformer._get_invalid_locations_maskβ to build masks inboolinstead ofuint8(litert-torch has no uint8 lowering).ConformerEncoder.{forward_internal,_create_masks}andMaskedConvSequential.{forward,_create_mask}β to keep the entire length pipeline inint32instead of NeMo's defaultint64, so LiteRT's GPU/NPU delegates can compile the graph without falling back to CPU.
Limitations
- Audio at position 0. The encoder expects audio anchored at the start of its input window. Padding before the audio causes hallucinations.
- 15 s max per call. Use the streaming chunker for longer clips.
- No VAD or diarization. Pair with an external VAD or a diarizer (e.g. Sortformer) for speaker-attributed transcripts.
- Multilingual but no language token. Code-switching works, but the model doesn't emit a language ID. Run a separate classifier if you need it.
License
Inherits the upstream nvidia/parakeet-tdt-0.6b-v3 license (CC-BY-4.0).
Citation
@misc{nvidia_parakeet_tdt_0_6b_v3,
title = {Parakeet-TDT-0.6B-v3},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
- Downloads last month
- 51
Model tree for spybyscript/parakeet-tdt-litert
Base model
nvidia/parakeet-tdt-0.6b-v3