Supertonic Quantized INT8 — Offline TTS (Shadow0482)

This repository contains INT8 optimized ONNX models for the Supertonic Text-To-Speech pipeline. These models are quantized versions of the official Supertonic models and are designed for offline, low-latency, CPU-friendly inference.

FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch (float32 vs float16) in a Div node, so FP16 inference is not stable.
Therefore, INT8 is the recommended format for real-world offline use.

🚀 Features

✔ 100% Offline Execution

No network needed. Load ONNX models directly using ONNX Runtime.

✔ Full Supertonic Inference Stack

Text Encoder
Duration Predictor
Vector Estimator
Vocoder

✔ INT8 Dynamic Quantization

Reduces model sizes dramatically
CPU-friendly inference
Very low memory usage
Compatible with ONNX Runtime CPUExecutionProvider

✔ Same Audio Quality Text Output

Produces understandable speech while being drastically faster on CPUs.

📦 Repository Structure


int8_dynamic/
duration_predictor.int8.onnx
text_encoder.int8.onnx
vector_estimator.int8.onnx
vocoder.int8.onnx

fp16/
(Contains experimental FP16 models — vocoder currently unstable)

Only the INT8 directory is guaranteed stable.

🔊 Test Sentence Used in Benchmark


Greetings! You are listening to your newly quantized model.
I have been squished, squeezed, compressed, minimized, optimized,
digitized, and lightly traumatized to save disk space.
The testing framework automatically verifies my integrity,
measures how much weight I lost,
and checks if I can still talk without glitching into a robot dolphin.
If you can hear this clearly, the quantization ritual was a complete success.

📈 Benchmark Summary (CPU)

Model	Precision	Time (s)	Output	Status
INT8 Dynamic	int8	varies: ~3.0–7.0s	`*.wav`	✅ OK
FP32 (baseline)	float32	~2–4× slower	`*.wav`	✅ OK
FP16	mixed	❌ FAILED	—	🚫 Cannot load vocoder

🖥️ Offline Inference Guide (Python)

Below is a clean Python script to run fully offline INT8 inference.

🧩 Requirements


pip install onnxruntime numpy soundfile

📜 offline_tts_int8.py

import onnxruntime as ort
import numpy as np
import json
import soundfile as sf
from pathlib import Path

# ---------------------------------------------------------
# 1) CONFIG
# ---------------------------------------------------------
MODEL_DIR = Path("int8_dynamic")   # folder containing *.int8.onnx
VOICE_STYLE = "assets/voice_styles/M1.json"

text_encoder_path      = MODEL_DIR / "text_encoder.int8.onnx"
duration_pred_path     = MODEL_DIR / "duration_predictor.int8.onnx"
vector_estimator_path  = MODEL_DIR / "vector_estimator.int8.onnx"
vocoder_path           = MODEL_DIR / "vocoder.int8.onnx"

TEST_TEXT = (
    "Hello! This is the INT8 offline version of Supertonic speaking. "
    "Everything you hear right now is running fully offline."
)

# ---------------------------------------------------------
# 2) TOKENIZER LOADING
# ---------------------------------------------------------
unicode_path = Path("assets/onnx/unicode_indexer.json")
tokenizer = json.load(open(unicode_path))

def encode_text(text: str):
    ids = []
    for ch in text:
        if ch in tokenizer["token2idx"]:
            ids.append(tokenizer["token2idx"][ch])
        else:
            ids.append(tokenizer["token2idx"]["<unk>"])
    return np.array([ids], dtype=np.int64)

# ---------------------------------------------------------
# 3) LOAD MODELS (CPU)
# ---------------------------------------------------------
def load_session(model_path):
    return ort.InferenceSession(
        str(model_path),
        providers=["CPUExecutionProvider"]
    )

sess_text = load_session(text_encoder_path)
sess_dur  = load_session(duration_pred_path)
sess_vec  = load_session(vector_estimator_path)
sess_voc  = load_session(vocoder_path)

# ---------------------------------------------------------
# 4) RUN TEXT ENCODER
# ---------------------------------------------------------
text_ids = encode_text(TEST_TEXT)
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)

text_out = sess_text.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_ttl": style_ttl
    }
)[0]

# ---------------------------------------------------------
# 5) RUN DURATION PREDICTOR
# ---------------------------------------------------------
style_dp = np.zeros((1, 8, 16), dtype=np.float32)

dur_out = sess_dur.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_dp": style_dp
    }
)[0]

durations = np.maximum(dur_out.astype(int), 1)

# ---------------------------------------------------------
# 6) VECTOR ESTIMATOR
# ---------------------------------------------------------
latent = sess_vec.run(None, {"latent": text_out})[0]

# ---------------------------------------------------------
# 7) VOCODER → WAV
# ---------------------------------------------------------
wav = sess_voc.run(None, {"latent": latent})[0][0]

sf.write("output_int8.wav", wav, 24000)
print("Saved: output_int8.wav")

🎧 Output

After running:

python offline_tts_int8.py

You will get:

output_int8.wav

Playable offline on any system.

📝 Notes

Only the INT8 models are stable & recommended.
FP16 vocoder currently fails due to a type mismatch in a Div node.
No internet connection is required for INT8 inference.
These models are ideal for embedded or low-spec machines.

📄 License

Models follow Supertone's licensing terms. Quantized versions follow the same licensing.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Shadow0482/supertonic-quantized

Base model

Supertone/supertonic

Quantized

(2)

this model