Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port

LiteRT (TFLite) port of nvidia/parakeet-tdt-0.6b-v3, packaged for on-device inference (Android / Mac / embedded) without a Python or NeMo runtime dependency.

For model capabilities, languages, training data, license, and benchmarks, see the upstream model card. This card only documents what's specific to the LiteRT port.

What's in this bundle

File	Size	Purpose
`encoder_T1500.tflite`	1.15 GB	FP16 encoder, fixed `T_mel = 1500` (15 s window)
`decoder_step.tflite`	23 MB	Single-step LSTM prediction network
`joint_step.tflite`	12 MB	TDT joint network (token + duration logits)
`tokenizer.model`	353 KB	SentencePiece BPE tokenizer (vocab=8192)
`manifest.json`	—	All metadata the runtime needs

Total: ~1.18 GB (FP16). FP32 reference is ~2.37 GB.

Encoder I/O contract

inputs:
  audio_signal : float32 [1, 128, 1500]   # log-mel features (NeMo preproc)
  length       : int32   [1]               # actual mel frames used (≤ 1500)
outputs:
  encoded         : float32 [1, 1024, 188]  # 188 = (1500 - 4) // 8
  encoded_lengths : int32   [1]

Pad shorter inputs with zeros at the tail (the encoder was trained with audio anchored at position 0; left-padding causes hallucinations) and pass the true length.

The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the encoder in a sliding-window streaming loop — see "Streaming usage" below.

Why int32, not int64. LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL, NPU accelerator) reject int64 tensors entirely. With int64 length, every internal CAST node touching it falls back to CPU, and CompiledModel.create() fails outright on Android with the GPU backend. This bundle is exported with int32 length end-to-end (input → internal mask arange/comparisons → output encoded_lengths). int32 covers > 2 billion mel frames (~5 hours of audio), so no practical range loss.

Why a single bucket and not multi-signature

An earlier revision shipped a multi-signature encoder with 4 buckets (300/500/700/1500) sharing weights inside one .tflite. The disk savings were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android the LiteRT CompiledModel.create() API prepares every signature's subgraph at load time — each one going through the full delegate-partition pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold start was ~28 s.

A single-bucket file is one subgraph: ~7 s init, then ready. If you need multiple bucket sizes for latency reasons, ship them as separate .tflite files (TFLite has no cross-file weight sharing) and load on demand.

Decoder + joint contract

decoder_step:
  inputs:  token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
  outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]

joint_step:
  inputs:  enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
  outputs: logits float32 [1,1,1,8198]
           # logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
           # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]

decoder_step.token is int64 because it's an embedding lookup; that op runs on CPU regardless of delegate, so int64 there is harmless.

Greedy TDT decoding (per encoder frame):

Run joint with current enc_frame and last predicted pred_frame.
token = argmax(token_logits); dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}.
If token != blank_id (8192): emit token, advance dur encoder frames, re-prime decoder with the emitted token (h, c update).
Else: advance max(dur, 1) encoder frames; do not advance the decoder.
Repeat until enc_lengths is exhausted.

Cap at ~10 non-blank emissions per encoder frame to guard against the pathological dur=0 decode loop.

Audio preprocessing

LiteRT itself does not produce mel features — your runtime must compute them. Match NeMo's preprocessor exactly:

sample_rate    : 16000 Hz (resample if needed)
n_fft          : 512
hop_length     : 160      → 100 mel frames / second
win_length     : 400
n_mels         : 128
preemph        : 0.97
log            : log(mel + 1e-5), per-feature normalized
mel_scale      : slaney

Encoder frame rate after the 8× subsampler: 12.5 fps (1 enc frame = 80 ms).

Streaming usage

This bundle supports chunked streaming inference using a left+chunk+right context window that fits inside 15 s. A reference Python implementation is in the upstream repo (transcribe_litert_streaming.py). Recommended config for Android UX:

Knob	Value	Reason
`chunk_seconds`	5	committed per step
`left_context_seconds`	5	encoder bilateral context
`right_context_seconds`	2	end-to-end latency ≈ 7 s
`window total`	12 s	(5 + 5 + 2) × 100 = 1200 mel ≤ 1500
`carry_state`	false	offline-trained model; carrying LSTM state across chunks tends to hurt

We measured ~27 % WER on multilingual long-form audio (EN/ES/IT code-switching) with this config, ~22 % on clean offline ≤15 s English.

Quantization

All .tflite weights are FP16. Activations remain FP32.
Bit-identical token output vs the upstream FP32 model on a 99-clip eval set.

Conversion provenance

Built from upstream nvidia/parakeet-tdt-0.6b-v3.nemo via:

NeMo → torch.export ExportedProgram (per encoder/decoder/joint module).
ExportedProgram → TFLite via litert-torch 0.8.0.
FP32 → FP16 via ai_edge_quantizer FLOAT_CASTING algorithm on FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.

Several NeMo internals required export-time monkey-patches:

MaskedConvSequential.{forward,_create_mask} and apply_channel_mask — to remove .expand(...) patterns rejected by the TFLite broadcast checker.
RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask — to build masks in bool instead of uint8 (litert-torch has no uint8 lowering).
ConformerEncoder.{forward_internal,_create_masks} and MaskedConvSequential.{forward,_create_mask} — to keep the entire length pipeline in int32 instead of NeMo's default int64, so LiteRT's GPU/NPU delegates can compile the graph without falling back to CPU.

Limitations

Audio at position 0. The encoder expects audio anchored at the start of its input window. Padding before the audio causes hallucinations.
15 s max per call. Use the streaming chunker for longer clips.
No VAD or diarization. Pair with an external VAD or a diarizer (e.g. Sortformer) for speaker-attributed transcripts.
Multilingual but no language token. Code-switching works, but the model doesn't emit a language ID. Run a separate classifier if you need it.

License

Inherits the upstream nvidia/parakeet-tdt-0.6b-v3 license (CC-BY-4.0).

Citation

@misc{nvidia_parakeet_tdt_0_6b_v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}

Downloads last month: 51

Model tree for spybyscript/parakeet-tdt-litert

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(39)

this model