VibeVoice-ASR ONNX

ONNX exports of microsoft/VibeVoice-ASR for browser-based speech recognition via Transformers.js v4 and WebGPU.

Model Architecture

VibeVoice-ASR is a composite speech recognition model:

Component	Base Architecture	Purpose
Speech Encoder	Custom (ConvNeXt-style VAE + Transformer)	Encodes 24kHz audio waveform to speech embeddings
Decoder (LM)	Qwen2-7B (28 layers, 28 heads, 4 KV heads)	Autoregressive text generation from speech + text embeddings

For ASR, the pipeline is:

Speech Encoder (encoder_model) — processes raw 24kHz audio through acoustic/semantic tokenizer encoders + connector projections to produce speech embeddings (1, T, 3584)
Decoder (decoder_model_merged) — Qwen2 LM takes concatenated [speech_embeddings, text_embeddings] and generates transcription text autoregressively

ONNX Files

The model is exported as two ONNX subgraphs wrapped for Transformers.js WhisperForConditionalGeneration:

File	Inputs	Outputs	Notes
`encoder_model_fp16.onnx` (+data shards)	`audio` (1, 1, samples)	`speech_embeddings` (1, T, 3584)	FP16, ~2.9 GB
`decoder_model_merged_{dtype}.onnx` (+data shards)	`input_ids`, `speech_embeddings`	`logits`	Quantized (see below)

Quantization Options

DType	Decoder Size	Total Download	Shard Count (decoder)
`int8`	~6 GB	~9 GB	5
`q4`	~2.5 GB	~5.4 GB	3

External data shards use the naming convention: {model}_{dtype}.onnx_data, _data_1, _data_2, etc. All shards are kept under 1.9 GB for browser ArrayBuffer compatibility.

Input Details

Sample rate: 24,000 Hz (24kHz mono)
Audio format: Raw waveform, float32
Tokenizer: Qwen2 (152,064 vocab size)
Special tokens: BOS=151643, EOS=151645

Usage

Browser (Transformers.js v4 + WebGPU)

This model is designed to be loaded with WhisperForConditionalGeneration.from_pretrained() from Transformers.js v4, which handles downloading, sharding, and WebGPU session creation.

import { AutoTokenizer, WhisperForConditionalGeneration } from "@huggingface/transformers";

const MODEL_ID = "akkikiki/VibeVoice-ASR-onnx";

// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(MODEL_ID);

// Load model with WebGPU + INT8 quantization
const model = await WhisperForConditionalGeneration.from_pretrained(MODEL_ID, {
    device: "webgpu",
    dtype: {
        encoder_model: "fp16",
        decoder_model_merged: "int8",  // or "q4"
    },
    use_external_data_format: {
        encoder_model: 1,               // 1 data shard
        decoder_model_merged: 5,        // 5 data shards (INT8) or 3 (Q4)
    },
});

// Access ONNX sessions
const encoderSession = model.sessions.model;
const decoderSession = model.sessions.decoder_model_merged;

// Encode speech (24kHz raw audio)
const audioTensor = new ort.Tensor("float32", audioData, [1, 1, audioData.length]);
const { speech_embeddings } = await encoderSession.run({ audio: audioTensor });

// Prepare prompt
const prompt = `<|im_start|>system
You are a helpful assistant that transcribes audio input into text output in JSON format.<|im_end|>
<|im_start|>user
This is a 5.00 seconds audio, please transcribe it with these keys: Start time, End time, Speaker ID, Content<|im_end|>
<|im_start|>assistant
`;
const promptIds = tokenizer(prompt, { return_tensors: false }).input_ids;

// Autoregressive decoding (no KV-cache)
let currentIds = [...promptIds];
for (let step = 0; step < 128; step++) {
    const result = await decoderSession.run({
        input_ids: new ort.Tensor("int64", new BigInt64Array(currentIds.map(BigInt)), [1, currentIds.length]),
        speech_embeddings: speech_embeddings,
    });

    // Greedy argmax over last position
    const logits = result.logits;
    const vocabSize = logits.dims[2];
    const offset = (logits.dims[1] - 1) * vocabSize;
    let maxIdx = 0, maxVal = -Infinity;
    for (let i = 0; i < vocabSize; i++) {
        if (logits.data[offset + i] > maxVal) {
            maxVal = logits.data[offset + i];
            maxIdx = i;
        }
    }

    if (maxIdx === 151645) break; // EOS
    currentIds.push(maxIdx);
}

const text = tokenizer.decode(currentIds.slice(promptIds.length), { skip_special_tokens: true });
console.log(text);

Python (ONNX Runtime)

import onnxruntime as ort
import numpy as np

encoder = ort.InferenceSession("onnx/encoder_model_fp16.onnx")
decoder = ort.InferenceSession("onnx/decoder_model_merged_int8.onnx")

# Encode speech (24kHz, mono, float32)
speech_emb = encoder.run(None, {"audio": audio_np})[0]

# Decode (greedy, no KV-cache)
logits = decoder.run(None, {
    "input_ids": prompt_ids,
    "speech_embeddings": speech_emb,
})[0]

Live Demo

Try it in the browser: VibeVoice-ASR WebGPU Demo

Export Tools

The ONNX models were exported using the scripts in whisper-web:

scripts/export_decoder_with_kvcache.py — Export decoder with KV-cache support
scripts/export_and_merge_q4_kvcache.py — Q4 quantization + causal mask fixup
scripts/quantize_kvcache_int8.py — INT8 streaming quantization

Known Limitations

WebGPU session creation is slow — The decoder has ~2300 nodes. Initial shader compilation can take several minutes on first load (cached afterward).
Custom encoder — The speech encoder uses a non-standard ConvNeXt-style VAE architecture. Do not apply standard transformer graph optimizations to it.
Large download — INT8 is ~9 GB, Q4 is ~5.4 GB. Requires good network connection for initial download (cached by browser afterward).

Citation

@article{VibeVoice,
  title={VibeVoice: Unifying Speech Recognition, Synthesis, and Translation with a Composite Multimodal Model},
  author={Microsoft},
  year={2025}
}

License

This model follows the license of the original microsoft/VibeVoice-ASR.

Downloads last month: 479

Model tree for akkikiki/VibeVoice-ASR-onnx

Base model

microsoft/VibeVoice-ASR

Quantized

(3)

this model