Parakeet TDT 0.6B v3 (MLX, Encoder INT4)
NVIDIA Parakeet TDT v3 with encoder-only INT4 quantization โ the lite variant for memory-constrained devices (8 GB Macs). 61% smaller than BF16 with only +0.4pp WER on real-world speech.
See also:
sonic-speech/parakeet-tdt-0.6b-v3โ Full BF16 referencesonic-speech/parakeet-tdt-0.6b-v3-int8โ Encoder INT8 (recommended)
Performance
| Metric | BF16 | INT4 | Change |
|---|---|---|---|
| WER (LibriSpeech) | 0.82% | 0.82% | None |
| WER (TED-LIUM) | 15.1% | 15.5% | +0.4pp |
| RTFx | 73x | 98x | +34% |
| Peak Memory | 3,002 MB | 1,003 MB | -67% |
| Weight Size | 1,254 MB | 489 MB | -61% |
Benchmarked on Apple M3 Max (64 GB), macOS Sequoia 15.7.3, MLX 0.30.4.
The small WER increase on TED-LIUM (+0.4pp) appears on speakers with challenging acoustics. LibriSpeech (clean speech) shows no degradation.
Recommendation: Use INT8 for general use. Use INT4 when memory is the constraint (8 GB unified memory Macs).
Quantization Details
Only the Conformer encoder (~85% of parameters) is quantized to INT4 (group_size=64). The decoder and joint network remain in BF16, preserving precision for token generation.
Usage
from parakeet import from_pretrained
import mlx.core as mx
model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int4", dtype=mx.bfloat16)
result = model.transcribe("audio.wav")
Install: pip install parakeet-mlx
Origin
Quantized from sonic-speech/parakeet-tdt-0.6b-v3 using mlx.nn.quantize (encoder-only, 4-bit, group_size=64).
Part of the Sonic Speech model collection.
- Downloads last month
- 30
Quantized
Model tree for sonic-speech/parakeet-tdt-0.6b-v3-int4
Base model
nvidia/parakeet-tdt-0.6b-v3