Lean
Collection
7 items • Updated
• 1
LeanLlama-8B-INT4 is a 4-bit quantized variant of LeanLlama-8B that combines NF4 weight quantization with learned KV cache compression. It reduces both model weight memory and inference-time KV cache memory, making it suitable for deployment on consumer GPUs.
The base Llama 3.1 8B Instruct weights are preserved through NF4 quantization. The KV cache compression modules remain in fp16 for maximum compression fidelity.
Expected quality relative to the uncompressed Llama 3.1 8B Instruct baseline:
| Metric | Delta |
|---|---|
| Perplexity | ~+7% |
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"miike-ai/LeanLlama-8B-INT4",
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("miike-ai/LeanLlama-8B-INT4")
inputs = tokenizer("What is the capital of France?", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
No special configuration is needed. The NF4 dequantization and KV cache compression both run transparently inside the model's forward pass.
meta-llama/Llama-3.1-8B-Instruct via miike-ai/LeanLlama-8B