VibeVoice-7B INT8 Quantized (bitsandbytes)

Pre-quantized 8-bit version of VibeVoice-7B using bitsandbytes LLM.int8() quantization.

Memory Usage

Precision	VRAM	Model Size
Full (bf16)	~17 GB	~14 GB
INT8 (this model)	~11.4 GB	~11.3 GB
INT4 (bnb-4bit)	~6.2 GB	~4.2 GB

Saves ~5.6 GB VRAM compared to full precision.

Important: Audio Components in BF16

This model uses a hybrid quantization approach:

Language Model (Qwen2-7B backbone): INT8 quantized
Audio Components: Remain in bfloat16 (NOT quantized)

The following modules are kept in full precision because LLM.int8() causes numerical instability in audio generation:

acoustic_tokenizer   - Audio encoder/decoder VAE
semantic_tokenizer   - Semantic audio encoder
prediction_head      - Diffusion head for audio generation
acoustic_connector   - Bridge between LM and acoustic features
semantic_connector   - Bridge between LM and semantic features
lm_head              - Text output head

This is why the model uses more VRAM than a fully-quantized INT8 model would.

Usage

from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
import torch

# The model already has quantization config embedded
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    "marksverdhai/vibevoice-7b-bnb-8bit",
    device_map={"": 0},
    torch_dtype=torch.bfloat16,
)

processor = VibeVoiceProcessor.from_pretrained("marksverdhai/vibevoice-7b-bnb-8bit")

# Generate speech
model.eval()
model.set_ddpm_inference_steps(num_steps=10)

inputs = processor(
    text=["Speaker 1: Hello, this is a test of the quantized model."],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)

for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda:0")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        is_prefill=False,
    )

audio = outputs.speech_outputs[0]  # 24kHz audio tensor

Voice Cloning

# With voice reference
inputs = processor(
    text=["Speaker 1: Hello in a cloned voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)

# Important: Set is_prefill=True for voice cloning
outputs = model.generate(
    **inputs,
    cfg_scale=1.3,
    tokenizer=processor.tokenizer,
    is_prefill=True,  # Required for voice cloning!
)

With vibevoice-api

from models.vibevoice import VibeVoiceModel

model = VibeVoiceModel(
    model_path="marksverdhai/vibevoice-7b-bnb-8bit",
    device="cuda:0",
    quantization="bnb-8bit",
)
model.load()

audio, sr = model.generate(
    text="Hello world!",
    speaker_name="alloy",
)

Technical Details

Why Hybrid Quantization?

During testing, we discovered that fully quantizing all modules with LLM.int8() produced pure noise instead of speech. The root cause:

LLM.int8() uses dynamic quantization with outlier detection (threshold=6.0)
Audio tokenizer weights have different distributions than typical LLM weights
The outlier detection mechanism fails for these layers, causing numerical instability
This propagates through the diffusion process, resulting in noise

NF4 (4-bit) doesn't have this issue because it always dequantizes to bfloat16 before computation.

Quantization Config

The model is saved with this configuration:

BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "prediction_head",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

Limitations

Requires bitsandbytes library
CUDA-only (no CPU inference)
Slightly slower than INT4 due to larger weights
Uses more VRAM than INT4 variant

License

Apache 2.0 (same as base model)

Citation

@misc{vibevoice7b-int8,
  title={VibeVoice-7B INT8 Quantized},
  author={marksverdhei},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/marksverdhai/vibevoice-7b-bnb-8bit}
}

Downloads last month: 44

Safetensors

Model size

9B params

Tensor type

BF16

F32

Model tree for marksverdhai/vibevoice-7b-bnb-8bit

Base model

vibevoice/VibeVoice-7B

Quantized

(3)

this model