granite-docling-258M-GGUF

This repository contains GGUF format quantized versions of ibm-granite/granite-docling-258M for use with llama.cpp.

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. This GGUF version enables fast CPU and GPU inference using llama.cpp, making it ideal for edge deployment and resource-constrained environments.

Model Summary

  • Original Model: ibm-granite/granite-docling-258M
  • Developed by: IBM Research
  • Model type: Multi-modal model (image+text-to-text)
  • Architecture: Idefics3 (SigLIP vision encoder + Granite 165M LLM)
  • Language(s): English (NLP)
  • License: Apache 2.0
  • Format: GGUF (for llama.cpp)
  • Quantizations: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M

Files Included

This repository contains multiple quantization levels of the text model, plus the vision encoder/projector. You need one text model file + the mmproj file for inference.

Text Model Files (choose one):

Filename Quant Size Use Case
granite-docling-258M-Q4_K_M.gguf Q4_K_M 133 MB Smallest, good quality-size balance, recommended for most users
granite-docling-258M-Q5_K_M.gguf Q5_K_M 139 MB Better quality, still compact
granite-docling-258M-Q6_K.gguf Q6_K 164 MB Higher quality
granite-docling-258M-Q8_0.gguf Q8_0 170 MB Very high quality
granite-docling-258M-f16.gguf F16 317 MB Highest quality, original precision

Vision/Projector File (required):

Filename Size Notes
mmproj-granite-docling-258M-f16.gguf 182 MB Vision encoder (SigLIP) - required for all quantizations

Note: The mmproj file is kept at F16 precision to maintain vision quality.

Getting Started with llama.cpp

Prerequisites

  1. Build llama.cpp with multimodal support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # For CUDA support
cmake --build build --config Release -j
  1. Download the GGUF files from this repository

Basic Usage

# Using Q4_K_M (recommended for most users)
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "{%- for message in messages -%}{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}{%- if message['content'] is string -%}{{- message['content'] -}}{%- else -%}{%- for part in message['content'] -%}{%- if part['type'] == 'text' -%}{{- part['text'] -}}{%- elif part['type'] == 'image' -%}{{- '<image>' -}}{%- endif -%}{%- endfor -%}{%- endif -%}{{- '<|end_of_text|>\n' -}}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<|start_of_role|>assistant' -}}{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}{{- '<|end_of_role|>' -}}{%- endif -%}" \
  -p "Convert this page to docling." \
  -n 512 \
  --temp 0.1 \
  -ngl 99

# Or use any other quantization (Q5_K_M, Q6_K, Q8_0, F16)
# Just replace the -m parameter with your chosen model file

Simplified Usage (Save Chat Template)

Save the chat template to a file for easier reuse:

# Save chat template
cat > granite_docling_template.jinja << 'EOF'
{%- for message in messages -%}
{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for part in message['content'] -%}
{%- if part['type'] == 'text' -%}
{{- part['text'] -}}
{%- elif part['type'] == 'image' -%}
{{- '<image>' -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{- '<|end_of_text|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_of_role|>assistant' -}}
{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
{{- '<|end_of_role|>' -}}
{%- endif -%}
EOF

# Then use it:
./llama.cpp/build/bin/llama-mtmd-cli \
  -m granite-docling-258M-Q4_K_M.gguf \
  --mmproj mmproj-granite-docling-258M-f16.gguf \
  --image document.png \
  --chat-template "$(cat granite_docling_template.jinja)" \
  -p "Convert this page to docling." \
  -n 512 \
  -ngl 99

Choosing a Quantization

Recommended for most users: Q4_K_M - Best balance of size and quality

Quantization Total Size Quality Speed RAM Usage
Q4_K_M 315 MB Good Fastest ~400 MB
Q5_K_M 321 MB Better Fast ~420 MB
Q6_K 346 MB High Medium ~450 MB
Q8_0 352 MB Very High Medium ~480 MB
F16 499 MB Highest Slower ~650 MB

Total size = text model + mmproj (182 MB)

Example Output

The model outputs DocTags format with precise layout information:

<doctag>
<page_header><loc_145><loc_28><loc_355><loc_35>ENERGY BUDGET OF WASP-121 b</page_header>
<text><loc_88><loc_42><loc_242><loc_89>while the kernel weights are structured as...</text>
...
</doctag>

Supported Instructions

Description Instruction
Full conversion Convert this page to docling.
Chart Convert chart to table.
Formula Convert formula to LaTeX.
Code Convert code to text.
Table Convert table to OTSL.
OCR region OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>

Performance

Tested on NVIDIA RTX 4070 Ti SUPER with CUDA:

  • Image encoding: ~6-8ms per slice (17 slices total for 512x512 images)
  • Prompt processing: ~1305 tokens/sec
  • Generation speed: ~706 tokens/sec
  • Total memory: ~600 MB GPU (with all layers offloaded)

Model Architecture

The architecture consists of:

  1. Vision encoder: siglip2-base-patch16-512

    • 768 hidden dimensions, 12 layers, 12 heads
    • 512x512 image input with 16x16 patches
    • Image splitting: 4x4 grid + global view = 17 frames
  2. Vision-language connector: Pixel shuffle projector (Idefics3 style)

    • Projects vision features to LLM embedding space
    • Scale factor: 4
  3. Large language model: Granite 165M

    • 30 layers, 9 attention heads, 3 KV heads
    • 576 hidden dimensions
    • Context length: 8192 tokens
    • Vocabulary: 100,352 tokens (GPT2 tokenizer with DocTags extensions)

Conversion Details

These GGUF files were converted from the original model using llama.cpp's conversion tools:

  1. Vision/Projector conversion (F16):

    python convert_hf_to_gguf.py granite-docling-258M \
      --mmproj --outtype f16 \
      --outfile mmproj-granite-docling-258M-f16.gguf
    
  2. Text model conversion:

    • Extracted text model weights from the full VLM model
    • Converted to GGUF with F16 precision
    • Preserved all special tokens and tokenizer configuration
  3. Quantization:

    llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q4_K_M.gguf Q4_K_M
    llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q5_K_M.gguf Q5_K_M
    llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q6_K.gguf Q6_K
    llama-quantize granite-docling-258M-f16.gguf granite-docling-258M-Q8_0.gguf Q8_0
    

Use Cases

Granite-Docling excels at:

  • πŸ“„ Document OCR: Extract text from scanned documents with layout preservation
  • πŸ“Š Table Recognition: Convert tables to structured formats (OTSL/HTML)
  • πŸ”’ Equation Recognition: Extract LaTeX from mathematical formulas
  • πŸ’» Code Recognition: Extract code snippets from documents
  • πŸ“ˆ Chart-to-Table: Convert charts and graphs to structured data
  • πŸ—‚οΈ Layout Analysis: Understand document structure (headers, footers, sections)

Limitations

  • Not for general image understanding: For general vision tasks, use Granite Vision models
  • Document-focused: Optimized for document pages, not natural images
  • English primary: Best performance on English documents (experimental support for Japanese, Arabic, Chinese)
  • Potential hallucination: Like all smaller VLMs, may hallucinate in complex scenarios

Responsible Use

This model is designed for document understanding and should be used responsibly:

  • Verify outputs for critical applications
  • Be aware of potential biases in document interpretation
  • Do not use for autonomous decision-making without human oversight
  • Consider using with Granite Guardian for additional safety

Citation

If you use this model, please cite:

@misc{granite-docling-2025,
  title={Granite Docling: Efficient Document Conversion with Vision-Language Models},
  author={IBM Research},
  year={2025},
  url={https://huggingface.co/ibm-granite/granite-docling-258M}
}

Resources

Acknowledgments

  • Original model by IBM Research
  • GGUF conversion using llama.cpp conversion tools
  • Thanks to the llama.cpp team for multimodal support
Downloads last month
34
GGUF
Model size
0.2B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for infil00p/granite-docling-258M-GGUF

Quantized
(9)
this model