VibeVoice-7B INT8 Quantized (bitsandbytes)
Pre-quantized 8-bit version of VibeVoice-7B using bitsandbytes LLM.int8() quantization.
Memory Usage
| Precision | VRAM | Model Size |
|---|---|---|
| Full (bf16) | ~17 GB | ~14 GB |
| INT8 (this model) | ~11.4 GB | ~11.3 GB |
| INT4 (bnb-4bit) | ~6.2 GB | ~4.2 GB |
Saves ~5.6 GB VRAM compared to full precision.
Important: Audio Components in BF16
This model uses a hybrid quantization approach:
- Language Model (Qwen2-7B backbone): INT8 quantized
- Audio Components: Remain in bfloat16 (NOT quantized)
The following modules are kept in full precision because LLM.int8() causes numerical instability in audio generation:
acoustic_tokenizer - Audio encoder/decoder VAE
semantic_tokenizer - Semantic audio encoder
prediction_head - Diffusion head for audio generation
acoustic_connector - Bridge between LM and acoustic features
semantic_connector - Bridge between LM and semantic features
lm_head - Text output head
This is why the model uses more VRAM than a fully-quantized INT8 model would.
Usage
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
import torch
# The model already has quantization config embedded
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
"marksverdhai/vibevoice-7b-bnb-8bit",
device_map={"": 0},
torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained("marksverdhai/vibevoice-7b-bnb-8bit")
# Generate speech
model.eval()
model.set_ddpm_inference_steps(num_steps=10)
inputs = processor(
text=["Speaker 1: Hello, this is a test of the quantized model."],
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
for k, v in inputs.items():
if torch.is_tensor(v):
inputs[k] = v.to("cuda:0")
with torch.no_grad():
outputs = model.generate(
**inputs,
cfg_scale=1.3,
tokenizer=processor.tokenizer,
generation_config={"do_sample": False},
is_prefill=False,
)
audio = outputs.speech_outputs[0] # 24kHz audio tensor
Voice Cloning
# With voice reference
inputs = processor(
text=["Speaker 1: Hello in a cloned voice!"],
voice_samples=[["path/to/reference.wav"]],
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
# Important: Set is_prefill=True for voice cloning
outputs = model.generate(
**inputs,
cfg_scale=1.3,
tokenizer=processor.tokenizer,
is_prefill=True, # Required for voice cloning!
)
With vibevoice-api
from models.vibevoice import VibeVoiceModel
model = VibeVoiceModel(
model_path="marksverdhai/vibevoice-7b-bnb-8bit",
device="cuda:0",
quantization="bnb-8bit",
)
model.load()
audio, sr = model.generate(
text="Hello world!",
speaker_name="alloy",
)
Technical Details
Why Hybrid Quantization?
During testing, we discovered that fully quantizing all modules with LLM.int8() produced pure noise instead of speech. The root cause:
- LLM.int8() uses dynamic quantization with outlier detection (threshold=6.0)
- Audio tokenizer weights have different distributions than typical LLM weights
- The outlier detection mechanism fails for these layers, causing numerical instability
- This propagates through the diffusion process, resulting in noise
NF4 (4-bit) doesn't have this issue because it always dequantizes to bfloat16 before computation.
Quantization Config
The model is saved with this configuration:
BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_skip_modules=[
"acoustic_tokenizer",
"semantic_tokenizer",
"prediction_head",
"acoustic_connector",
"semantic_connector",
"lm_head",
],
)
Limitations
- Requires bitsandbytes library
- CUDA-only (no CPU inference)
- Slightly slower than INT4 due to larger weights
- Uses more VRAM than INT4 variant
License
Apache 2.0 (same as base model)
Citation
@misc{vibevoice7b-int8,
title={VibeVoice-7B INT8 Quantized},
author={marksverdhei},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/marksverdhai/vibevoice-7b-bnb-8bit}
}
- Downloads last month
- 44
Model tree for marksverdhai/vibevoice-7b-bnb-8bit
Base model
vibevoice/VibeVoice-7B