Supertonic Quantized INT8 β€” Offline TTS (Shadow0482)

This repository contains INT8 optimized ONNX models for the Supertonic Text-To-Speech pipeline. These models are quantized versions of the official Supertonic models and are designed for offline, low-latency, CPU-friendly inference.

FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch (float32 vs float16) in a Div node, so FP16 inference is not stable.
Therefore, INT8 is the recommended format for real-world offline use.


πŸš€ Features

βœ” 100% Offline Execution

No network needed. Load ONNX models directly using ONNX Runtime.

βœ” Full Supertonic Inference Stack

  • Text Encoder
  • Duration Predictor
  • Vector Estimator
  • Vocoder

βœ” INT8 Dynamic Quantization

  • Reduces model sizes dramatically
  • CPU-friendly inference
  • Very low memory usage
  • Compatible with ONNX Runtime CPUExecutionProvider

βœ” Same Audio Quality Text Output

Produces understandable speech while being drastically faster on CPUs.


πŸ“¦ Repository Structure


int8_dynamic/
duration_predictor.int8.onnx
text_encoder.int8.onnx
vector_estimator.int8.onnx
vocoder.int8.onnx

fp16/
(Contains experimental FP16 models β€” vocoder currently unstable)

Only the INT8 directory is guaranteed stable.


πŸ”Š Test Sentence Used in Benchmark


Greetings! You are listening to your newly quantized model.
I have been squished, squeezed, compressed, minimized, optimized,
digitized, and lightly traumatized to save disk space.
The testing framework automatically verifies my integrity,
measures how much weight I lost,
and checks if I can still talk without glitching into a robot dolphin.
If you can hear this clearly, the quantization ritual was a complete success.

πŸ“ˆ Benchmark Summary (CPU)

Model Precision Time (s) Output Status
INT8 Dynamic int8 varies: ~3.0–7.0s *.wav βœ… OK
FP32 (baseline) float32 ~2–4Γ— slower *.wav βœ… OK
FP16 mixed ❌ FAILED β€” 🚫 Cannot load vocoder

πŸ–₯️ Offline Inference Guide (Python)

Below is a clean Python script to run fully offline INT8 inference.


🧩 Requirements


pip install onnxruntime numpy soundfile

πŸ“œ offline_tts_int8.py

import onnxruntime as ort
import numpy as np
import json
import soundfile as sf
from pathlib import Path

# ---------------------------------------------------------
# 1) CONFIG
# ---------------------------------------------------------
MODEL_DIR = Path("int8_dynamic")   # folder containing *.int8.onnx
VOICE_STYLE = "assets/voice_styles/M1.json"

text_encoder_path      = MODEL_DIR / "text_encoder.int8.onnx"
duration_pred_path     = MODEL_DIR / "duration_predictor.int8.onnx"
vector_estimator_path  = MODEL_DIR / "vector_estimator.int8.onnx"
vocoder_path           = MODEL_DIR / "vocoder.int8.onnx"

TEST_TEXT = (
    "Hello! This is the INT8 offline version of Supertonic speaking. "
    "Everything you hear right now is running fully offline."
)

# ---------------------------------------------------------
# 2) TOKENIZER LOADING
# ---------------------------------------------------------
unicode_path = Path("assets/onnx/unicode_indexer.json")
tokenizer = json.load(open(unicode_path))

def encode_text(text: str):
    ids = []
    for ch in text:
        if ch in tokenizer["token2idx"]:
            ids.append(tokenizer["token2idx"][ch])
        else:
            ids.append(tokenizer["token2idx"]["<unk>"])
    return np.array([ids], dtype=np.int64)

# ---------------------------------------------------------
# 3) LOAD MODELS (CPU)
# ---------------------------------------------------------
def load_session(model_path):
    return ort.InferenceSession(
        str(model_path),
        providers=["CPUExecutionProvider"]
    )

sess_text = load_session(text_encoder_path)
sess_dur  = load_session(duration_pred_path)
sess_vec  = load_session(vector_estimator_path)
sess_voc  = load_session(vocoder_path)

# ---------------------------------------------------------
# 4) RUN TEXT ENCODER
# ---------------------------------------------------------
text_ids = encode_text(TEST_TEXT)
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)

text_out = sess_text.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_ttl": style_ttl
    }
)[0]

# ---------------------------------------------------------
# 5) RUN DURATION PREDICTOR
# ---------------------------------------------------------
style_dp = np.zeros((1, 8, 16), dtype=np.float32)

dur_out = sess_dur.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_dp": style_dp
    }
)[0]

durations = np.maximum(dur_out.astype(int), 1)

# ---------------------------------------------------------
# 6) VECTOR ESTIMATOR
# ---------------------------------------------------------
latent = sess_vec.run(None, {"latent": text_out})[0]

# ---------------------------------------------------------
# 7) VOCODER β†’ WAV
# ---------------------------------------------------------
wav = sess_voc.run(None, {"latent": latent})[0][0]

sf.write("output_int8.wav", wav, 24000)
print("Saved: output_int8.wav")

🎧 Output

After running:

python offline_tts_int8.py

You will get:

output_int8.wav

Playable offline on any system.


πŸ“ Notes

  • Only the INT8 models are stable & recommended.
  • FP16 vocoder currently fails due to a type mismatch in a Div node.
  • No internet connection is required for INT8 inference.
  • These models are ideal for embedded or low-spec machines.

πŸ“„ License

Models follow Supertone's licensing terms. Quantized versions follow the same licensing.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Shadow0482/supertonic-quantized

Quantized
(2)
this model