Parakeet-TDT-0.6B-v3 β€” LiteRT (TFLite) port

LiteRT (TFLite) port of nvidia/parakeet-tdt-0.6b-v3, packaged for on-device inference (Android / Mac / embedded) without a Python or NeMo runtime dependency.

For model capabilities, languages, training data, license, and benchmarks, see the upstream model card. This card only documents what's specific to the LiteRT port.

What's in this bundle

File Size Purpose
encoder_T1500.tflite 1.15 GB FP16 encoder, fixed T_mel = 1500 (15 s window)
decoder_step.tflite 23 MB Single-step LSTM prediction network
joint_step.tflite 12 MB TDT joint network (token + duration logits)
tokenizer.model 353 KB SentencePiece BPE tokenizer (vocab=8192)
manifest.json β€” All metadata the runtime needs

Total: ~1.18 GB (FP16). FP32 reference is ~2.37 GB.

Encoder I/O contract

inputs:
  audio_signal : float32 [1, 128, 1500]   # log-mel features (NeMo preproc)
  length       : int32   [1]               # actual mel frames used (≀ 1500)
outputs:
  encoded         : float32 [1, 1024, 188]  # 188 = (1500 - 4) // 8
  encoded_lengths : int32   [1]

Pad shorter inputs with zeros at the tail (the encoder was trained with audio anchored at position 0; left-padding causes hallucinations) and pass the true length.

The 1500-mel bucket covers ≀ 15 s of audio. For long-form input, run the encoder in a sliding-window streaming loop β€” see "Streaming usage" below.

Why int32, not int64. LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL, NPU accelerator) reject int64 tensors entirely. With int64 length, every internal CAST node touching it falls back to CPU, and CompiledModel.create() fails outright on Android with the GPU backend. This bundle is exported with int32 length end-to-end (input β†’ internal mask arange/comparisons β†’ output encoded_lengths). int32 covers > 2 billion mel frames (~5 hours of audio), so no practical range loss.

Why a single bucket and not multi-signature

An earlier revision shipped a multi-signature encoder with 4 buckets (300/500/700/1500) sharing weights inside one .tflite. The disk savings were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android the LiteRT CompiledModel.create() API prepares every signature's subgraph at load time β€” each one going through the full delegate-partition pass. With 4 signatures Γ— ~7 s of XNNPACK / GPU partition prep, app cold start was ~28 s.

A single-bucket file is one subgraph: ~7 s init, then ready. If you need multiple bucket sizes for latency reasons, ship them as separate .tflite files (TFLite has no cross-file weight sharing) and load on demand.

Decoder + joint contract

decoder_step:
  inputs:  token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
  outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]

joint_step:
  inputs:  enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
  outputs: logits float32 [1,1,1,8198]
           # logits[..., 0:8193] β†’ token logits (8192 BPE + 1 blank)
           # logits[..., 8193:8198] β†’ duration logits over [0,1,2,3,4]

decoder_step.token is int64 because it's an embedding lookup; that op runs on CPU regardless of delegate, so int64 there is harmless.

Greedy TDT decoding (per encoder frame):

  1. Run joint with current enc_frame and last predicted pred_frame.
  2. token = argmax(token_logits); dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}.
  3. If token != blank_id (8192): emit token, advance dur encoder frames, re-prime decoder with the emitted token (h, c update).
  4. Else: advance max(dur, 1) encoder frames; do not advance the decoder.
  5. Repeat until enc_lengths is exhausted.

Cap at ~10 non-blank emissions per encoder frame to guard against the pathological dur=0 decode loop.

Audio preprocessing

LiteRT itself does not produce mel features β€” your runtime must compute them. Match NeMo's preprocessor exactly:

sample_rate    : 16000 Hz (resample if needed)
n_fft          : 512
hop_length     : 160      β†’ 100 mel frames / second
win_length     : 400
n_mels         : 128
preemph        : 0.97
log            : log(mel + 1e-5), per-feature normalized
mel_scale      : slaney

Encoder frame rate after the 8Γ— subsampler: 12.5 fps (1 enc frame = 80 ms).

Streaming usage

This bundle supports chunked streaming inference using a left+chunk+right context window that fits inside 15 s. A reference Python implementation is in the upstream repo (transcribe_litert_streaming.py). Recommended config for Android UX:

Knob Value Reason
chunk_seconds 5 committed per step
left_context_seconds 5 encoder bilateral context
right_context_seconds 2 end-to-end latency β‰ˆ 7 s
window total 12 s (5 + 5 + 2) Γ— 100 = 1200 mel ≀ 1500
carry_state false offline-trained model; carrying LSTM state across chunks tends to hurt

We measured ~27 % WER on multilingual long-form audio (EN/ES/IT code-switching) with this config, ~22 % on clean offline ≀15 s English.

Quantization

  • All .tflite weights are FP16. Activations remain FP32.
  • Bit-identical token output vs the upstream FP32 model on a 99-clip eval set.

Conversion provenance

Built from upstream nvidia/parakeet-tdt-0.6b-v3.nemo via:

  1. NeMo β†’ torch.export ExportedProgram (per encoder/decoder/joint module).
  2. ExportedProgram β†’ TFLite via litert-torch 0.8.0.
  3. FP32 β†’ FP16 via ai_edge_quantizer FLOAT_CASTING algorithm on FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.

Several NeMo internals required export-time monkey-patches:

  • MaskedConvSequential.{forward,_create_mask} and apply_channel_mask β€” to remove .expand(...) patterns rejected by the TFLite broadcast checker.
  • RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask β€” to build masks in bool instead of uint8 (litert-torch has no uint8 lowering).
  • ConformerEncoder.{forward_internal,_create_masks} and MaskedConvSequential.{forward,_create_mask} β€” to keep the entire length pipeline in int32 instead of NeMo's default int64, so LiteRT's GPU/NPU delegates can compile the graph without falling back to CPU.

Limitations

  1. Audio at position 0. The encoder expects audio anchored at the start of its input window. Padding before the audio causes hallucinations.
  2. 15 s max per call. Use the streaming chunker for longer clips.
  3. No VAD or diarization. Pair with an external VAD or a diarizer (e.g. Sortformer) for speaker-attributed transcripts.
  4. Multilingual but no language token. Code-switching works, but the model doesn't emit a language ID. Run a separate classifier if you need it.

License

Inherits the upstream nvidia/parakeet-tdt-0.6b-v3 license (CC-BY-4.0).

Citation

@misc{nvidia_parakeet_tdt_0_6b_v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
Downloads last month
51
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for spybyscript/parakeet-tdt-litert

Finetuned
(39)
this model