Parakeet TDT 0.6B v3 (MLX, Encoder INT4)

NVIDIA Parakeet TDT v3 with encoder-only INT4 quantization — the lite variant for memory-constrained devices (8 GB Macs). 61% smaller than BF16 with only +0.4pp WER on real-world speech.

Performance

Metric	BF16	INT4	Change
WER (LibriSpeech)	0.82%	0.82%	None
WER (TED-LIUM)	15.1%	15.5%	+0.4pp
RTFx	73x	98x	+34%
Peak Memory	3,002 MB	1,003 MB	-67%
Weight Size	1,254 MB	489 MB	-61%

Benchmarked on Apple M3 Max (64 GB), macOS Sequoia 15.7.3, MLX 0.30.4.

The small WER increase on TED-LIUM (+0.4pp) appears on speakers with challenging acoustics. LibriSpeech (clean speech) shows no degradation.

Recommendation: Use INT8 for general use. Use INT4 when memory is the constraint (8 GB unified memory Macs).

Quantization Details

Only the Conformer encoder (~85% of parameters) is quantized to INT4 (group_size=64). The decoder and joint network remain in BF16, preserving precision for token generation.

Usage

from parakeet import from_pretrained
import mlx.core as mx

model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int4", dtype=mx.bfloat16)
result = model.transcribe("audio.wav")

Install: pip install parakeet-mlx

Origin

Quantized from sonic-speech/parakeet-tdt-0.6b-v3 using mlx.nn.quantize (encoder-only, 4-bit, group_size=64).

Part of the Sonic Speech model collection.

Downloads last month: 30

MLX

Hardware compatibility

Quantized

Model tree for sonic-speech/parakeet-tdt-0.6b-v3-int4

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

sonic-speech/parakeet-tdt-0.6b-v3

Finetuned

(2)

this model