VibeVoice-7B INT8 Quantized (bitsandbytes)

Pre-quantized 8-bit version of VibeVoice-7B using bitsandbytes LLM.int8() quantization.

Memory Usage

Precision VRAM Model Size
Full (bf16) ~17 GB ~14 GB
INT8 (this model) ~11.4 GB ~11.3 GB
INT4 (bnb-4bit) ~6.2 GB ~4.2 GB

Saves ~5.6 GB VRAM compared to full precision.

Important: Audio Components in BF16

This model uses a hybrid quantization approach:

  • Language Model (Qwen2-7B backbone): INT8 quantized
  • Audio Components: Remain in bfloat16 (NOT quantized)

The following modules are kept in full precision because LLM.int8() causes numerical instability in audio generation:

acoustic_tokenizer   - Audio encoder/decoder VAE
semantic_tokenizer   - Semantic audio encoder
prediction_head      - Diffusion head for audio generation
acoustic_connector   - Bridge between LM and acoustic features
semantic_connector   - Bridge between LM and semantic features
lm_head              - Text output head

This is why the model uses more VRAM than a fully-quantized INT8 model would.

Usage

from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
import torch

# The model already has quantization config embedded
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    "marksverdhai/vibevoice-7b-bnb-8bit",
    device_map={"": 0},
    torch_dtype=torch.bfloat16,
)

processor = VibeVoiceProcessor.from_pretrained("marksverdhai/vibevoice-7b-bnb-8bit")

# Generate speech
model.eval()
model.set_ddpm_inference_steps(num_steps=10)

inputs = processor(
    text=["Speaker 1: Hello, this is a test of the quantized model."],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)

for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to("cuda:0")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        is_prefill=False,
    )

audio = outputs.speech_outputs[0]  # 24kHz audio tensor

Voice Cloning

# With voice reference
inputs = processor(
    text=["Speaker 1: Hello in a cloned voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)

# Important: Set is_prefill=True for voice cloning
outputs = model.generate(
    **inputs,
    cfg_scale=1.3,
    tokenizer=processor.tokenizer,
    is_prefill=True,  # Required for voice cloning!
)

With vibevoice-api

from models.vibevoice import VibeVoiceModel

model = VibeVoiceModel(
    model_path="marksverdhai/vibevoice-7b-bnb-8bit",
    device="cuda:0",
    quantization="bnb-8bit",
)
model.load()

audio, sr = model.generate(
    text="Hello world!",
    speaker_name="alloy",
)

Technical Details

Why Hybrid Quantization?

During testing, we discovered that fully quantizing all modules with LLM.int8() produced pure noise instead of speech. The root cause:

  1. LLM.int8() uses dynamic quantization with outlier detection (threshold=6.0)
  2. Audio tokenizer weights have different distributions than typical LLM weights
  3. The outlier detection mechanism fails for these layers, causing numerical instability
  4. This propagates through the diffusion process, resulting in noise

NF4 (4-bit) doesn't have this issue because it always dequantizes to bfloat16 before computation.

Quantization Config

The model is saved with this configuration:

BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_skip_modules=[
        "acoustic_tokenizer",
        "semantic_tokenizer",
        "prediction_head",
        "acoustic_connector",
        "semantic_connector",
        "lm_head",
    ],
)

Limitations

  • Requires bitsandbytes library
  • CUDA-only (no CPU inference)
  • Slightly slower than INT4 due to larger weights
  • Uses more VRAM than INT4 variant

License

Apache 2.0 (same as base model)

Citation

@misc{vibevoice7b-int8,
  title={VibeVoice-7B INT8 Quantized},
  author={marksverdhei},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/marksverdhai/vibevoice-7b-bnb-8bit}
}
Downloads last month
44
Safetensors
Model size
9B params
Tensor type
BF16
F32
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for marksverdhai/vibevoice-7b-bnb-8bit

Quantized
(3)
this model