Visual Document Retrieval
Transformers
ONNX
ColPali
English
pretraining
kitsuneb's picture
small change for text
1221e23

ONNX Model Conversion Notes

First of all, this was rather fun to do!

I figured out that it might not be so explicit that the convert script I made only applies to vision and you would do, almost, the exact same for image inputs. So I have extended it a tiny bit and left this comment. Especially since the intended use of Colpali is to run image embedding at offline time (getting your vector db ready) and the text model is intended for online (query) time.

However now Ive included that in the convert.py script as well, its not much of a change. Ive excluded uploading the text those model files since it is exactly the same process as the vision one, so results will be the same and uploading takes a long time with my home wifi unfortunately.

Ive opted for two models, in theory you could split up the image and text inputs into several graphs and call them in the correct order since they do share (some) weights for each input type. However given the intended use of Colpali offline/online case its not necassary and probably overkill for this exercise.

Some practical notes

The convert.py script is based on code I made on Google Colab in order to have access to a GPU. The requirements.txt might not be perfect, I'd much rather use UV which I use on a daily basis however this was created in Google colab in a fast manner.

Also note that I checked the output of the converted models and the original to compare:

  • The fp32 (default ONNX) is nearly the same as the original HF model.
  • However, the FP16 converted ONNX model is not exactly the same, there is a margin of error.

Below is a code snippet that showcases the comparison for image input:

import torch
import numpy as np
import onnxruntime as ort
from PIL import Image
from transformers import ColPaliForRetrieval, ColPaliProcessor

MODEL_ID  = "vidore/colpali-v1.3-hf"
# change this to the FP16 version or FP32 version
ONNX_PATH = "/content/final_colpali/model.onnx"
DEVICE    = "cpu"

hf = (
    ColPaliForRetrieval
    # NOTE change this to torch.float16 when we are doing ONNX fp16
    # Also change
      .from_pretrained(MODEL_ID, torch_dtype=torch.float16)
      .to(DEVICE)
      .eval()
)
processor = ColPaliProcessor.from_pretrained(MODEL_ID)

img = Image.new("RGB", (32, 32), color="white")
inputs = processor(images=[img], return_tensors="pt").to(DEVICE)

with torch.no_grad():
    out = hf(**inputs)
hf_emb = out.embeddings.cpu().numpy()

sess = ort.InferenceSession(ONNX_PATH, providers=["CPUExecutionProvider"])
ort_inputs = {k: v.cpu().numpy() for k, v in inputs.items()}
[onnx_emb] = sess.run(["embeddings"], ort_inputs)
python
## Wil output  for FP32
## 0.999~
## Will output for FP16
## 0.939~
dot   = np.sum(hf_emb * onnx_emb, axis=-1)
norms = np.linalg.norm(hf_emb,axis=-1) * np.linalg.norm(onnx_emb,axis=-1)
cosim = dot / norms
cosim.min()