Supertonic Quantized INT8 β Offline TTS (Shadow0482)
This repository contains INT8 optimized ONNX models for the Supertonic Text-To-Speech pipeline. These models are quantized versions of the official Supertonic models and are designed for offline, low-latency, CPU-friendly inference.
FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
(float32 vs float16) in a Div node, so FP16 inference is not stable.
Therefore, INT8 is the recommended format for real-world offline use.
π Features
β 100% Offline Execution
No network needed. Load ONNX models directly using ONNX Runtime.
β Full Supertonic Inference Stack
- Text Encoder
- Duration Predictor
- Vector Estimator
- Vocoder
β INT8 Dynamic Quantization
- Reduces model sizes dramatically
- CPU-friendly inference
- Very low memory usage
- Compatible with ONNX Runtime CPUExecutionProvider
β Same Audio Quality Text Output
Produces understandable speech while being drastically faster on CPUs.
π¦ Repository Structure
int8_dynamic/
duration_predictor.int8.onnx
text_encoder.int8.onnx
vector_estimator.int8.onnx
vocoder.int8.onnx
fp16/
(Contains experimental FP16 models β vocoder currently unstable)
Only the INT8 directory is guaranteed stable.
π Test Sentence Used in Benchmark
Greetings! You are listening to your newly quantized model.
I have been squished, squeezed, compressed, minimized, optimized,
digitized, and lightly traumatized to save disk space.
The testing framework automatically verifies my integrity,
measures how much weight I lost,
and checks if I can still talk without glitching into a robot dolphin.
If you can hear this clearly, the quantization ritual was a complete success.
π Benchmark Summary (CPU)
| Model | Precision | Time (s) | Output | Status |
|---|---|---|---|---|
| INT8 Dynamic | int8 | varies: ~3.0β7.0s | *.wav |
β OK |
| FP32 (baseline) | float32 | ~2β4Γ slower | *.wav |
β OK |
| FP16 | mixed | β FAILED | β | π« Cannot load vocoder |
π₯οΈ Offline Inference Guide (Python)
Below is a clean Python script to run fully offline INT8 inference.
π§© Requirements
pip install onnxruntime numpy soundfile
π offline_tts_int8.py
import onnxruntime as ort
import numpy as np
import json
import soundfile as sf
from pathlib import Path
# ---------------------------------------------------------
# 1) CONFIG
# ---------------------------------------------------------
MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx
VOICE_STYLE = "assets/voice_styles/M1.json"
text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx"
duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx"
vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx"
vocoder_path = MODEL_DIR / "vocoder.int8.onnx"
TEST_TEXT = (
"Hello! This is the INT8 offline version of Supertonic speaking. "
"Everything you hear right now is running fully offline."
)
# ---------------------------------------------------------
# 2) TOKENIZER LOADING
# ---------------------------------------------------------
unicode_path = Path("assets/onnx/unicode_indexer.json")
tokenizer = json.load(open(unicode_path))
def encode_text(text: str):
ids = []
for ch in text:
if ch in tokenizer["token2idx"]:
ids.append(tokenizer["token2idx"][ch])
else:
ids.append(tokenizer["token2idx"]["<unk>"])
return np.array([ids], dtype=np.int64)
# ---------------------------------------------------------
# 3) LOAD MODELS (CPU)
# ---------------------------------------------------------
def load_session(model_path):
return ort.InferenceSession(
str(model_path),
providers=["CPUExecutionProvider"]
)
sess_text = load_session(text_encoder_path)
sess_dur = load_session(duration_pred_path)
sess_vec = load_session(vector_estimator_path)
sess_voc = load_session(vocoder_path)
# ---------------------------------------------------------
# 4) RUN TEXT ENCODER
# ---------------------------------------------------------
text_ids = encode_text(TEST_TEXT)
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)
text_out = sess_text.run(
None,
{
"text_ids": text_ids,
"text_mask": text_mask,
"style_ttl": style_ttl
}
)[0]
# ---------------------------------------------------------
# 5) RUN DURATION PREDICTOR
# ---------------------------------------------------------
style_dp = np.zeros((1, 8, 16), dtype=np.float32)
dur_out = sess_dur.run(
None,
{
"text_ids": text_ids,
"text_mask": text_mask,
"style_dp": style_dp
}
)[0]
durations = np.maximum(dur_out.astype(int), 1)
# ---------------------------------------------------------
# 6) VECTOR ESTIMATOR
# ---------------------------------------------------------
latent = sess_vec.run(None, {"latent": text_out})[0]
# ---------------------------------------------------------
# 7) VOCODER β WAV
# ---------------------------------------------------------
wav = sess_voc.run(None, {"latent": latent})[0][0]
sf.write("output_int8.wav", wav, 24000)
print("Saved: output_int8.wav")
π§ Output
After running:
python offline_tts_int8.py
You will get:
output_int8.wav
Playable offline on any system.
π Notes
- Only the INT8 models are stable & recommended.
- FP16 vocoder currently fails due to a type mismatch in a
Divnode. - No internet connection is required for INT8 inference.
- These models are ideal for embedded or low-spec machines.
π License
Models follow Supertone's licensing terms. Quantized versions follow the same licensing.
Model tree for Shadow0482/supertonic-quantized
Base model
Supertone/supertonic