granite-docling-258M-GGUF
This repository contains GGUF format quantized versions of ibm-granite/granite-docling-258M for use with llama.cpp.
Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. This GGUF version enables fast CPU and GPU inference using llama.cpp, making it ideal for edge deployment and resource-constrained environments.
Model Summary
- Original Model: ibm-granite/granite-docling-258M
- Developed by: IBM Research
- Model type: Multi-modal model (image+text-to-text)
- Architecture: Idefics3 (SigLIP vision encoder + Granite 165M LLM)
- Language(s): English (NLP)
- License: Apache 2.0
- Format: GGUF (for llama.cpp)
- Quantizations: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M
Files Included
This repository contains multiple quantization levels of the text model, plus the vision encoder/projector. You need one text model file + the mmproj file for inference.
Text Model Files (choose one):
| Filename | Quant | Size | Use Case |
|---|---|---|---|
| granite-docling-258M-Q4_K_M.gguf | Q4_K_M | 133 MB | Smallest, good quality-size balance, recommended for most users |
| granite-docling-258M-Q5_K_M.gguf | Q5_K_M | 139 MB | Better quality, still compact |
| granite-docling-258M-Q6_K.gguf | Q6_K | 164 MB | Higher quality |
| granite-docling-258M-Q8_0.gguf | Q8_0 | 170 MB | Very high quality |
| granite-docling-258M-f16.gguf | F16 | 317 MB | Highest quality, original precision |
Vision/Projector File (required):
| Filename | Size | Notes |
|---|---|---|
| mmproj-granite-docling-258M-f16.gguf | 182 MB | Vision encoder (SigLIP) - required for all quantizations |
Note: The mmproj file is kept at F16 precision to maintain vision quality.
Getting Started with llama.cpp
Prerequisites
- Build llama.cpp with multimodal support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # For CUDA support
cmake --build build --config Release -j
- Download the GGUF files from this repository
Basic Usage
# Using Q4_K_M (recommended for most users)
./llama.cpp/build/bin/llama-mtmd-cli \
-m granite-docling-258M-Q4_K_M.gguf \
--mmproj mmproj-granite-docling-258M-f16.gguf \
--image document.png \
--chat-template "{%- for message in messages -%}{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}{%- if message['content'] is string -%}{{- message['content'] -}}{%- else -%}{%- for part in message['content'] -%}{%- if part['type'] == 'text' -%}{{- part['text'] -}}{%- elif part['type'] == 'image' -%}{{- '<image>' -}}{%- endif -%}{%- endfor -%}{%- endif -%}{{- '<|end_of_text|>\n' -}}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<|start_of_role|>assistant' -}}{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}{{- '<|end_of_role|>' -}}{%- endif -%}" \
-p "Convert this page to docling." \
-n 512 \
--temp 0.1 \
-ngl 99
# Or use any other quantization (Q5_K_M, Q6_K, Q8_0, F16)
# Just replace the -m parameter with your chosen model file
Simplified Usage (Save Chat Template)
Save the chat template to a file for easier reuse:
# Save chat template
cat > granite_docling_template.jinja << 'EOF'
{%- for message in messages -%}
{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for part in message['content'] -%}
{%- if part['type'] == 'text' -%}
{{- part['text'] -}}
{%- elif part['type'] == 'image' -%}
{{- '<image>' -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{- '<|end_of_text|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_of_role|>assistant' -}}
{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
{{- '<|end_of_role|>' -}}
{%- endif -%}
EOF
# Then use it:
./llama.cpp/build/bin/llama-mtmd-cli \
-m granite-docling-258M-Q4_K_M.gguf \
--mmproj mmproj-granite-docling-258M-f16.gguf \
--image document.png \
--chat-template "$(cat granite_docling_template.jinja)" \
-p "Convert this page to docling." \
-n 512 \
-ngl 99
Choosing a Quantization
Recommended for most users: Q4_K_M - Best balance of size and quality
| Quantization | Total Size | Quality | Speed | RAM Usage |
|---|---|---|---|---|
| Q4_K_M | 315 MB | Good | Fastest | ~400 MB |
| Q5_K_M | 321 MB | Better | Fast | ~420 MB |
| Q6_K | 346 MB | High | Medium | ~450 MB |
| Q8_0 | 352 MB | Very High | Medium | ~480 MB |
| F16 | 499 MB | Highest | Slower | ~650 MB |
Total size = text model + mmproj (182 MB)
Example Output
The model outputs DocTags format with precise layout information:
<doctag>
<page_header><loc_145><loc_28><loc_355><loc_35>ENERGY BUDGET OF WASP-121 b</page_header>
<text><loc_88><loc_42><loc_242><loc_89>while the kernel weights are structured as...</text>
...
</doctag>
Supported Instructions
| Description | Instruction |
|---|---|
| Full conversion | Convert this page to docling. |
| Chart | Convert chart to table. |
| Formula | Convert formula to LaTeX. |
| Code | Convert code to text. |
| Table | Convert table to OTSL. |
| OCR region | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> |
Performance
Tested on NVIDIA RTX 4070 Ti SUPER with CUDA:
- Image encoding: ~6-8ms per slice (17 slices total for 512x512 images)
- Prompt processing: ~1305 tokens/sec
- Generation speed: ~706 tokens/sec
- Total memory: ~600 MB GPU (with all layers offloaded)
Model Architecture
The architecture consists of:
Vision encoder: siglip2-base-patch16-512
- 768 hidden dimensions, 12 layers, 12 heads
- 512x512 image input with 16x16 patches
- Image splitting: 4x4 grid + global view = 17 frames
Vision-language connector: Pixel shuffle projector (Idefics3 style)
- Projects vision features to LLM embedding space
- Scale factor: 4
Large language model: Granite 165M
- 30 layers, 9 attention heads, 3 KV heads
- 576 hidden dimensions
- Context length: 8192 tokens
- Vocabulary: 100,352 tokens (GPT2 tokenizer with DocTags extensions)
Conversion Details
These GGUF files were converted from the original model using llama.cpp's conversion tools:
Vision/Projector conversion (F16):
python convert_hf_to_gguf.py granite-docling-258M \ --mmproj --outtype f16 \ --outfile mmproj-granite-docling-258M-f16.ggufText model conversion:
- Extracted text model weights from the full VLM model
- Converted to GGUF with F16 precision
- Preserved all special tokens and tokenizer configuration
Quantization:
llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q4_K_M.gguf Q4_K_M llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q5_K_M.gguf Q5_K_M llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q6_K.gguf Q6_K llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q8_0.gguf Q8_0
Use Cases
Granite-Docling excels at:
- π Document OCR: Extract text from scanned documents with layout preservation
- π Table Recognition: Convert tables to structured formats (OTSL/HTML)
- π’ Equation Recognition: Extract LaTeX from mathematical formulas
- π» Code Recognition: Extract code snippets from documents
- π Chart-to-Table: Convert charts and graphs to structured data
- ποΈ Layout Analysis: Understand document structure (headers, footers, sections)
Limitations
- Not for general image understanding: For general vision tasks, use Granite Vision models
- Document-focused: Optimized for document pages, not natural images
- English primary: Best performance on English documents (experimental support for Japanese, Arabic, Chinese)
- Potential hallucination: Like all smaller VLMs, may hallucinate in complex scenarios
Responsible Use
This model is designed for document understanding and should be used responsibly:
- Verify outputs for critical applications
- Be aware of potential biases in document interpretation
- Do not use for autonomous decision-making without human oversight
- Consider using with Granite Guardian for additional safety
Citation
If you use this model, please cite:
@misc{granite-docling-2025,
title={Granite Docling: Efficient Document Conversion with Vision-Language Models},
author={IBM Research},
year={2025},
url={https://huggingface.co/ibm-granite/granite-docling-258M}
}
Resources
- π Original Model
- π₯ Docling Library
- π llama.cpp
- π Docling Documentation
- π‘ Granite Resources
Acknowledgments
- Original model by IBM Research
- GGUF conversion using llama.cpp conversion tools
- Thanks to the llama.cpp team for multimodal support
- Downloads last month
- 34
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for infil00p/granite-docling-258M-GGUF
Base model
ibm-granite/granite-docling-258M