VibeVoice-ASR ONNX
ONNX exports of microsoft/VibeVoice-ASR for browser-based speech recognition via Transformers.js v4 and WebGPU.
Model Architecture
VibeVoice-ASR is a composite speech recognition model:
| Component | Base Architecture | Purpose |
|---|---|---|
| Speech Encoder | Custom (ConvNeXt-style VAE + Transformer) | Encodes 24kHz audio waveform to speech embeddings |
| Decoder (LM) | Qwen2-7B (28 layers, 28 heads, 4 KV heads) | Autoregressive text generation from speech + text embeddings |
For ASR, the pipeline is:
- Speech Encoder (
encoder_model) β processes raw 24kHz audio through acoustic/semantic tokenizer encoders + connector projections to produce speech embeddings(1, T, 3584) - Decoder (
decoder_model_merged) β Qwen2 LM takes concatenated[speech_embeddings, text_embeddings]and generates transcription text autoregressively
ONNX Files
The model is exported as two ONNX subgraphs wrapped for Transformers.js WhisperForConditionalGeneration:
| File | Inputs | Outputs | Notes |
|---|---|---|---|
encoder_model_fp16.onnx (+data shards) |
audio (1, 1, samples) |
speech_embeddings (1, T, 3584) |
FP16, ~2.9 GB |
decoder_model_merged_{dtype}.onnx (+data shards) |
input_ids, speech_embeddings |
logits |
Quantized (see below) |
Quantization Options
| DType | Decoder Size | Total Download | Shard Count (decoder) |
|---|---|---|---|
int8 |
~6 GB | ~9 GB | 5 |
q4 |
~2.5 GB | ~5.4 GB | 3 |
External data shards use the naming convention: {model}_{dtype}.onnx_data, _data_1, _data_2, etc. All shards are kept under 1.9 GB for browser ArrayBuffer compatibility.
Input Details
- Sample rate: 24,000 Hz (24kHz mono)
- Audio format: Raw waveform, float32
- Tokenizer: Qwen2 (152,064 vocab size)
- Special tokens: BOS=151643, EOS=151645
Usage
Browser (Transformers.js v4 + WebGPU)
This model is designed to be loaded with WhisperForConditionalGeneration.from_pretrained() from Transformers.js v4, which handles downloading, sharding, and WebGPU session creation.
import { AutoTokenizer, WhisperForConditionalGeneration } from "@huggingface/transformers";
const MODEL_ID = "akkikiki/VibeVoice-ASR-onnx";
// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(MODEL_ID);
// Load model with WebGPU + INT8 quantization
const model = await WhisperForConditionalGeneration.from_pretrained(MODEL_ID, {
device: "webgpu",
dtype: {
encoder_model: "fp16",
decoder_model_merged: "int8", // or "q4"
},
use_external_data_format: {
encoder_model: 1, // 1 data shard
decoder_model_merged: 5, // 5 data shards (INT8) or 3 (Q4)
},
});
// Access ONNX sessions
const encoderSession = model.sessions.model;
const decoderSession = model.sessions.decoder_model_merged;
// Encode speech (24kHz raw audio)
const audioTensor = new ort.Tensor("float32", audioData, [1, 1, audioData.length]);
const { speech_embeddings } = await encoderSession.run({ audio: audioTensor });
// Prepare prompt
const prompt = `<|im_start|>system
You are a helpful assistant that transcribes audio input into text output in JSON format.<|im_end|>
<|im_start|>user
This is a 5.00 seconds audio, please transcribe it with these keys: Start time, End time, Speaker ID, Content<|im_end|>
<|im_start|>assistant
`;
const promptIds = tokenizer(prompt, { return_tensors: false }).input_ids;
// Autoregressive decoding (no KV-cache)
let currentIds = [...promptIds];
for (let step = 0; step < 128; step++) {
const result = await decoderSession.run({
input_ids: new ort.Tensor("int64", new BigInt64Array(currentIds.map(BigInt)), [1, currentIds.length]),
speech_embeddings: speech_embeddings,
});
// Greedy argmax over last position
const logits = result.logits;
const vocabSize = logits.dims[2];
const offset = (logits.dims[1] - 1) * vocabSize;
let maxIdx = 0, maxVal = -Infinity;
for (let i = 0; i < vocabSize; i++) {
if (logits.data[offset + i] > maxVal) {
maxVal = logits.data[offset + i];
maxIdx = i;
}
}
if (maxIdx === 151645) break; // EOS
currentIds.push(maxIdx);
}
const text = tokenizer.decode(currentIds.slice(promptIds.length), { skip_special_tokens: true });
console.log(text);
Python (ONNX Runtime)
import onnxruntime as ort
import numpy as np
encoder = ort.InferenceSession("onnx/encoder_model_fp16.onnx")
decoder = ort.InferenceSession("onnx/decoder_model_merged_int8.onnx")
# Encode speech (24kHz, mono, float32)
speech_emb = encoder.run(None, {"audio": audio_np})[0]
# Decode (greedy, no KV-cache)
logits = decoder.run(None, {
"input_ids": prompt_ids,
"speech_embeddings": speech_emb,
})[0]
Live Demo
Try it in the browser: VibeVoice-ASR WebGPU Demo
Export Tools
The ONNX models were exported using the scripts in whisper-web:
scripts/export_decoder_with_kvcache.pyβ Export decoder with KV-cache supportscripts/export_and_merge_q4_kvcache.pyβ Q4 quantization + causal mask fixupscripts/quantize_kvcache_int8.pyβ INT8 streaming quantization
Known Limitations
- WebGPU session creation is slow β The decoder has ~2300 nodes. Initial shader compilation can take several minutes on first load (cached afterward).
- Custom encoder β The speech encoder uses a non-standard ConvNeXt-style VAE architecture. Do not apply standard transformer graph optimizations to it.
- Large download β INT8 is ~9 GB, Q4 is ~5.4 GB. Requires good network connection for initial download (cached by browser afterward).
Citation
@article{VibeVoice,
title={VibeVoice: Unifying Speech Recognition, Synthesis, and Translation with a Composite Multimodal Model},
author={Microsoft},
year={2025}
}
License
This model follows the license of the original microsoft/VibeVoice-ASR.
- Downloads last month
- 479
Model tree for akkikiki/VibeVoice-ASR-onnx
Base model
microsoft/VibeVoice-ASR