YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

LeanLlama-8B-INT4

LeanLlama-8B-INT4 is a 4-bit quantized variant of LeanLlama-8B that combines NF4 weight quantization with learned KV cache compression. It reduces both model weight memory and inference-time KV cache memory, making it suitable for deployment on consumer GPUs.

What changed

Weight quantization (Phase 1): All transformer weights (including embeddings) are quantized to NF4 with double quantization, reducing the model from ~16 GB to ~5.7 GB on disk.
KV cache compression (Phase 2): Inherited from LeanLlama-8B. Learned projection modules compress the value representations stored in the KV cache at a subset of layers, reducing the memory footprint of long-context inference.

The base Llama 3.1 8B Instruct weights are preserved through NF4 quantization. The KV cache compression modules remain in fp16 for maximum compression fidelity.

Quality

Expected quality relative to the uncompressed Llama 3.1 8B Instruct baseline:

Metric	Delta
Perplexity	~+7%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "miike-ai/LeanLlama-8B-INT4",
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("miike-ai/LeanLlama-8B-INT4")

inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))

No special configuration is needed. The NF4 dequantization and KV cache compression both run transparently inside the model's forward pass.

Base model

Architecture: Llama 3.1
Parameters: 8B
Source: meta-llama/Llama-3.1-8B-Instruct via miike-ai/LeanLlama-8B
Context window: 128K tokens
Quantization: NF4 (bitsandbytes) with double quantization
License: Llama 3.1 Community License

Downloads last month: 10

Safetensors

Model size

8B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including miike-ai/LeanLlama-8B-INT4

Lean

Collection

7 items • Updated 16 days ago • 1