granite-docling-258M-GGUF

This repository contains GGUF format quantized versions of ibm-granite/granite-docling-258M for use with llama.cpp.

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. This GGUF version enables fast CPU and GPU inference using llama.cpp, making it ideal for edge deployment and resource-constrained environments.

Model Summary

Original Model: ibm-granite/granite-docling-258M
Developed by: IBM Research
Model type: Multi-modal model (image+text-to-text)
Architecture: Idefics3 (SigLIP vision encoder + Granite 165M LLM)
Language(s): English (NLP)
License: Apache 2.0
Format: GGUF (for llama.cpp)
Quantizations: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M

Files Included

This repository contains multiple quantization levels of the text model, plus the vision encoder/projector. You need one text model file + the mmproj file for inference.

Text Model Files (choose one):

Filename	Quant	Size	Use Case
granite-docling-258M-Q4_K_M.gguf	Q4_K_M	133 MB	Smallest, good quality-size balance, recommended for most users
granite-docling-258M-Q5_K_M.gguf	Q5_K_M	139 MB	Better quality, still compact
granite-docling-258M-Q6_K.gguf	Q6_K	164 MB	Higher quality
granite-docling-258M-Q8_0.gguf	Q8_0	170 MB	Very high quality
granite-docling-258M-f16.gguf	F16	317 MB	Highest quality, original precision

Vision/Projector File (required):

Filename	Size	Notes
mmproj-granite-docling-258M-f16.gguf	182 MB	Vision encoder (SigLIP) - required for all quantizations

Note: The mmproj file is kept at F16 precision to maintain vision quality.

Getting Started with llama.cpp

Prerequisites

Build llama.cpp with multimodal support:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # For CUDA support
cmake --build build --config Release -j

Download the GGUF files from this repository

Basic Usage

# Using Q4_K_M (recommended for most users)
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "{%- for message in messages -%}{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}{%- if message['content'] is string -%}{{- message['content'] -}}{%- else -%}{%- for part in message['content'] -%}{%- if part['type'] == 'text' -%}{{- part['text'] -}}{%- elif part['type'] == 'image' -%}{{- '<image>' -}}{%- endif -%}{%- endfor -%}{%- endif -%}{{- '<|end_of_text|>\n' -}}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<|start_of_role|>assistant' -}}{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}{{- '<|end_of_role|>' -}}{%- endif -%}" \
  -p "Convert this page to docling." \
  -n 512 \
  --temp 0.1 \
  -ngl 99

# Or use any other quantization (Q5_K_M, Q6_K, Q8_0, F16)
# Just replace the -m parameter with your chosen model file

Simplified Usage (Save Chat Template)

Save the chat template to a file for easier reuse:

# Save chat template
cat > granite_docling_template.jinja << 'EOF'
{%- for message in messages -%}
{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for part in message['content'] -%}
{%- if part['type'] == 'text' -%}
{{- part['text'] -}}
{%- elif part['type'] == 'image' -%}
{{- '<image>' -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{- '<|end_of_text|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_of_role|>assistant' -}}
{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
{{- '<|end_of_role|>' -}}
{%- endif -%}
EOF

# Then use it:
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "$(cat granite_docling_template.jinja)" \
  -p "Convert this page to docling." \
  -n 512 \
  -ngl 99

Choosing a Quantization

Recommended for most users: Q4_K_M - Best balance of size and quality

Quantization	Total Size	Quality	Speed	RAM Usage
Q4_K_M	315 MB	Good	Fastest	~400 MB
Q5_K_M	321 MB	Better	Fast	~420 MB
Q6_K	346 MB	High	Medium	~450 MB
Q8_0	352 MB	Very High	Medium	~480 MB
F16	499 MB	Highest	Slower	~650 MB

Total size = text model + mmproj (182 MB)

Example Output

The model outputs DocTags format with precise layout information:

<doctag>
<page_header><loc_145><loc_28><loc_355><loc_35>ENERGY BUDGET OF WASP-121 b</page_header>
<text><loc_88><loc_42><loc_242><loc_89>while the kernel weights are structured as...</text>
...
</doctag>

Supported Instructions

Description	Instruction
Full conversion	Convert this page to docling.
Chart	Convert chart to table.
Formula	Convert formula to LaTeX.
Code	Convert code to text.
Table	Convert table to OTSL.
OCR region	OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>

Performance

Tested on NVIDIA RTX 4070 Ti SUPER with CUDA:

Image encoding: ~6-8ms per slice (17 slices total for 512x512 images)
Prompt processing: ~1305 tokens/sec
Generation speed: ~706 tokens/sec
Total memory: ~600 MB GPU (with all layers offloaded)

Model Architecture

The architecture consists of:

Vision encoder: siglip2-base-patch16-512
- 768 hidden dimensions, 12 layers, 12 heads
- 512x512 image input with 16x16 patches
- Image splitting: 4x4 grid + global view = 17 frames
Vision-language connector: Pixel shuffle projector (Idefics3 style)
- Projects vision features to LLM embedding space
- Scale factor: 4
Large language model: Granite 165M
- 30 layers, 9 attention heads, 3 KV heads
- 576 hidden dimensions
- Context length: 8192 tokens
- Vocabulary: 100,352 tokens (GPT2 tokenizer with DocTags extensions)

Conversion Details

These GGUF files were converted from the original model using llama.cpp's conversion tools:

Vision/Projector conversion (F16):

python convert_hf_to_gguf.py granite-docling-258M \
  --mmproj --outtype f16 \
  --outfile mmproj-granite-docling-258M-f16.gguf

Text model conversion:
- Extracted text model weights from the full VLM model
- Converted to GGUF with F16 precision
- Preserved all special tokens and tokenizer configuration

Quantization:

llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q4_K_M.gguf Q4_K_M
llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q5_K_M.gguf Q5_K_M
llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q6_K.gguf Q6_K
llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q8_0.gguf Q8_0

Use Cases

Granite-Docling excels at:

📄 Document OCR: Extract text from scanned documents with layout preservation
📊 Table Recognition: Convert tables to structured formats (OTSL/HTML)
🔢 Equation Recognition: Extract LaTeX from mathematical formulas
💻 Code Recognition: Extract code snippets from documents
📈 Chart-to-Table: Convert charts and graphs to structured data
🗂️ Layout Analysis: Understand document structure (headers, footers, sections)

Limitations

Not for general image understanding: For general vision tasks, use Granite Vision models
Document-focused: Optimized for document pages, not natural images
English primary: Best performance on English documents (experimental support for Japanese, Arabic, Chinese)
Potential hallucination: Like all smaller VLMs, may hallucinate in complex scenarios

Responsible Use

This model is designed for document understanding and should be used responsibly:

Verify outputs for critical applications
Be aware of potential biases in document interpretation
Do not use for autonomous decision-making without human oversight
Consider using with Granite Guardian for additional safety

Citation

If you use this model, please cite:

@misc{granite-docling-2025,
  title={Granite Docling: Efficient Document Conversion with Vision-Language Models},
  author={IBM Research},
  year={2025},
  url={https://huggingface.co/ibm-granite/granite-docling-258M}
}

Resources

Acknowledgments

Original model by IBM Research
GGUF conversion using llama.cpp conversion tools
Thanks to the llama.cpp team for multimodal support

Downloads last month: 34

GGUF

Model size

0.2B params

Architecture

llama

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for infil00p/granite-docling-258M-GGUF

Base model

ibm-granite/granite-docling-258M

Quantized

(9)

this model