YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

LeanLlama-8B-INT4

LeanLlama-8B-INT4 is a 4-bit quantized variant of LeanLlama-8B that combines NF4 weight quantization with learned KV cache compression. It reduces both model weight memory and inference-time KV cache memory, making it suitable for deployment on consumer GPUs.

What changed

  • Weight quantization (Phase 1): All transformer weights (including embeddings) are quantized to NF4 with double quantization, reducing the model from ~16 GB to ~5.7 GB on disk.
  • KV cache compression (Phase 2): Inherited from LeanLlama-8B. Learned projection modules compress the value representations stored in the KV cache at a subset of layers, reducing the memory footprint of long-context inference.

The base Llama 3.1 8B Instruct weights are preserved through NF4 quantization. The KV cache compression modules remain in fp16 for maximum compression fidelity.

Quality

Expected quality relative to the uncompressed Llama 3.1 8B Instruct baseline:

Metric Delta
Perplexity ~+7%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "miike-ai/LeanLlama-8B-INT4",
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("miike-ai/LeanLlama-8B-INT4")

inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))

No special configuration is needed. The NF4 dequantization and KV cache compression both run transparently inside the model's forward pass.

Base model

  • Architecture: Llama 3.1
  • Parameters: 8B
  • Source: meta-llama/Llama-3.1-8B-Instruct via miike-ai/LeanLlama-8B
  • Context window: 128K tokens
  • Quantization: NF4 (bitsandbytes) with double quantization
  • License: Llama 3.1 Community License
Downloads last month
10
Safetensors
Model size
8B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including miike-ai/LeanLlama-8B-INT4