hxssgaa's picture
Update README.md
4965843 verified
---
license: apache-2.0
license_name: apache-2.0
license_link: https://www.apache.org/licenses/LICENSE-2.0
tags:
- text
- image
- video
- multimodal-embedding
- vidore
- colpali
- colqwen3
- multilingual-embedding
language:
- multilingual
library_name: transformers
pipeline_tag: visual-document-retrieval
base_model:
- Qwen/Qwen3-VL-8B-Instruct
---
# TomoroAI/tomoro-colqwen3-embed-8b
## ⚡ Executive Summary
**TomoroAI/tomoro-colqwen3-embed-8b** is a state-of-the-art [ColPali](https://arxiv.org/abs/2407.01449)-style multimodal embedding model. It maps text queries, visual documents (images, PDFs) or short videos into aligned multi-vector embeddings.
Built by merging **[Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)** with **[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)**, this model inherits robust text retrieval capabilities while preserving a full vision stack. It has been fine-tuned on a curated mixture of [VDR](https://huggingface.co/datasets/vdr-multilingual-train), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data), and [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data). It achieves SOTA or competitive performance across **ViDoRe V1-V3** (English and Multilingual) while offering a significantly reduced embedding footprint compared to other full-dim Colpali model alternatives.
## 🛠️ Model Specifications
| Feature | Detail |
| :--- | :--- |
| **Architecture** | Qwen3-VL 8B (Encoder-only variant) + 320-dim Projection Head |
| **Methodology** | ColPali-style Late Interaction (MaxSim scoring) |
| **Token Budget** | Up to 1,280 visual tokens per page or 5120 visual tokens per video (text prompts constrained only by the base context window) |
| **Context Window** | 32k (inherited from base), typical usage < 2k tokens |
| **Output** | Multi-vector (Seq_Len × 320), L2-normalized |
| **Supported Modalities** | Text Queries, RGB Images, Synthetic Documents, Short Video (Frame-wise) |
| **Precision** | `bfloat16` weights, FlashAttention 2 enabled |
### Key Properties
* **Merged Encoders:** Combines the Qwen3-VL vision encoder (patch-grid tokens with spatial merge) and language encoder.
* **Projection:** A custom 320-dim head projects every token (text or visual) into a vector.
* **Processing:**
* **Queries:** Left-padded text sequences.
* **Documents:** Rendered with a lightweight vision prompt and flattened into image tokens.
* **Video:** Supports video retrieval by decoding videos into frames and processing via the vision stack (generalization capability, not explicitly fine-tuned; dedicated benchmark coming soon).
* **Storage Efficiency:**
* *Baseline (NVIDIA Nemo-3B):* Stores 1,802 tokens @ 3,072 dims (≈10.3 TB for 1M images).
* *Tomoro ColQwen3:* Stores max 1,280 tokens @ 320 dims (**≈0.82 TB for 1M images**).
* **Result:** **13× smaller footprint** with higher performance.
---
## 📊 Evaluation Results
We report results on the **ViDoRe** benchmark suite. The model sets new standards on multilingual and English splits on ViDoRe V2 and V3 while maintaining comparable high performance on ViDoRe V1.
### ViDoRe V3 (Latest)
**English nDCG@5**
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | 0.7443 | **0.6491** | **0.6823** | **0.4546** | **0.6421** | 0.5766 | **0.6665** | **0.4747** | **0.6113** |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 0.7419 | 0.6023 | 0.6753 | 0.4202 | 0.6037 | **0.5787** | 0.6612 | 0.4640 | 0.5934 |
| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) | 0.7514 | 0.5838 | 0.6712 | 0.3730 | 0.6256 | 0.5447 | 0.6524 | 0.4128 | 0.5769 |
| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) | 0.7175 | 0.5842 | 0.6417 | 0.3859 | 0.6206 | 0.5443 | 0.6303 | 0.4191 | 0.5680 |
| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | **0.7528** | 0.5824 | 0.6041 | 0.3877 | 0.6060 | 0.5229 | 0.6226 | 0.4423 | 0.5651 |
**Multilingual nDCG@5** (Excluding English Subsets)
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | **Avg** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | 0.7194 | **0.6619** | **0.6172** | **0.4570** | **0.6097** | **0.5164** | **0.6403** | **0.4706** | **0.5866** |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 0.7213 | 0.6374 | 0.6019 | 0.4305 | 0.5637 | 0.5131 | 0.6351 | 0.4636 | 0.5708 |
| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) | 0.7216 | 0.5901 | 0.5646 | 0.4102 | 0.5504 | 0.4335 | 0.6170 | 0.4192 | 0.5383 |
| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) | 0.6843 | 0.6036 | 0.5482 | 0.4249 | 0.5542 | 0.4732 | 0.6059 | 0.4381 | 0.5416 |
| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | **0.7333** | 0.6160 | 0.5219 | 0.4169 | 0.5494 | 0.4764 | 0.5938 | 0.4449 | 0.5441 |
### ViDoRe V2
**English nDCG@5**
| Model | BioMed | ESG HL | ESG Rpts | Economics | **Avg** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | **0.6784** | **0.7598** | **0.6549** | 0.6159 | **0.6772** |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 0.6718 | 0.7465 | 0.6300 | 0.5910 | 0.6598 |
| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) | 0.6518 | 0.7538 | 0.6030 | **0.6619** | 0.6676 |
| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) | 0.6359 | 0.6512 | 0.5194 | 0.5955 | 0.6005 |
| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | 0.6479 | 0.6871 | 0.5498 | 0.5955 | 0.6201 |
**Multilingual nDCG@5**
| Model | BioMed | ESG Rpts | Economics | **Avg** |
| :--- | :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | 0.6467 | 0.5911 | **0.5875** | **0.6085** |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | **0.6478** | **0.6226** | 0.5536 | 0.6080 |
| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) | 0.6187 | 0.5640 | 0.5506 | 0.5778 |
| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) | 0.5994 | 0.5178 | 0.5364 | 0.5512 |
| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | 0.6224 | 0.5336 | 0.5433 | 0.5664 |
### ViDoRe V1 (English nDCG@5)
| Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | **Avg** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | **0.9115** | **0.6637** | 0.9448 | 0.8789 | 0.9926 | 0.9671 | **0.9758** | 0.9906 | 0.9423 | 0.8092 | 0.9076 |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 0.9066 | 0.6624 | 0.9429 | 0.8739 | 0.9926 | 0.9691 | 0.9717 | **0.9963** | 0.9433 | 0.7983 | 0.9057 |
| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) | 0.8835 | 0.6621 | **0.9492** | 0.9070 | **0.9963** | 0.9663 | 0.9782 | 0.9926 | 0.9594 | 0.8057 | **0.9100** |
| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) | 0.8846 | 0.6014 | 0.9379 | **0.9293** | 0.9926 | **0.9726** | 0.9659 | 0.9913 | 0.9560 | 0.8035 | 0.9035 |
| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | 0.8832 | 0.6011 | 0.9221 | 0.8930 | 0.9876 | 0.9626 | 0.9592 | 0.9926 | **0.9596** | **0.8108** | 0.8972 |
### Video Retrieval Evaluation
To demonstrate that Tomoro ColQwen3 strongly generalizes to video retrieval, we evaluated the models on the [CareBench](https://carebench.github.io/) for text to video (General Retrieval) task and [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) video_ret benchmark.
#### CareBench Evaluation
For this evaluation, we utilized a **raw video encoding** approach: our models encoded the video files directly without any additional textual annotations or metadata inputs. This highlights the model's ability to perform retrieval based purely on visual semantics.
| Model | Recall@1 | Recall@5 | Recall@10 |
| :--- | :--- | :--- | :--- |
| **[tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b)** | **0.8670** | **0.9590** | 0.9850 |
| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) | 0.8620 | 0.9570 | 0.9800 |
| [Care7B](https://huggingface.co/MCG-NJU/CaRe-7B) | 0.7700 | 0.9560 | **0.9870** |
#### MMEB-V2 video_ret Evaluation
All below evaluations are using Hit@1 metric.
| Model | MSR-VTT | MSVD | DiDeMo | VATEX | YouCook2 | Average |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) | 50.3 | 71.2 | 58.8 | 48.0 | 27.8 | 51.2 |
| **[tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b)** | 51.1 | 72.3 | **59.5** | 49.0 | 26.6 | **51.7** |
| **[IFM-TTE-7B](https://interestfm-tte.github.io/)** | 52.7 | **73.1** | 49.7 | **51.5** | **31.6** | **51.7** |
| [seed-1.6-embedding](https://seed1-6-embedding.github.io/) | **55.3** | 71.3 | 56.7 | 48.8 | 24.6 | 51.3 |
`IFM-TTE-7B` and `seed-1.6-embedding` utilize video-text fine-tuning, whereas the Tomoro ColQwen series relies solely on image-text data.
---
## 💻 Usage
The processor exposes `process_texts`, `process_images`, and `score_multi_vector`.
### Prerequisites
We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`
Currently we only support torch==2.8.0, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low.
```bash
pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers pillow requests
pip install flash-attn --no-build-isolation
```
### Inference Code
```python
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO
# Configuration
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load Model & Processor
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
# Sample Data
queries = [
"Retrieve the city of Singapore",
"Retrieve the city of Beijing",
"Retrieve the city of London",
]
docs = [
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
]
def load_image(url: str) -> Image.Image:
# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code == 403:
continue
resp.raise_for_status()
try:
return Image.open(BytesIO(resp.content)).convert("RGB")
except UnidentifiedImageError as e:
raise RuntimeError(f"Failed to decode image from {url}") from e
raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
# Helper Functions
def encode_queries(texts, batch_size=8):
outputs = []
for start in range(0, len(texts), batch_size):
batch = processor.process_texts(texts=texts[start : start + batch_size])
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
def encode_docs(urls, batch_size=4):
pil_images = [load_image(url) for url in urls]
outputs = []
for start in range(0, len(pil_images), batch_size):
batch_imgs = pil_images[start : start + batch_size]
features = processor.process_images(images=batch_imgs)
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
with torch.inference_mode():
out = model(**features)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)
# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
```
### 🎞️ Lightweight Video Retrieval
ColQwen3 generalizes to short videos while learning from image-text retrieval task. This minimal example samples a clip with `torchvision`, encodes queries and frames, then pools frame embeddings with a per-dimension max before MaxSim scoring.
We recommand use of maximum 5120 visual tokens for video retrieval task for best performance.
```python
from pathlib import Path
import torch
from transformers import AutoModel, AutoProcessor
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=5120,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
queries = ["Retrieve the football video", "Find the basketball clip", "Find the swimming clip", "Find the wrestling clip"]
videos = ["/root/sample_videos/football.mp4", "/root/sample_videos/basketball.mp4", "/root/sample_videos/swimming.mp4", "/root/sample_videos/wrestling.mp4"]
def encode_queries(texts):
batch = processor.process_texts(texts=texts)
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
return out.embeddings.to(torch.bfloat16).cpu()
def encode_videos(paths):
vids = [str(Path(p).expanduser()) for p in paths]
feats = processor(
videos=vids,
padding="longest",
return_tensors=None, # keep metadata as Python objects until we drop it
videos_kwargs={"return_metadata": True},
)
feats.pop("video_metadata", None) # drop metadata before forwarding to the model
feats = feats.convert_to_tensors(tensor_type="pt")
feats = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in feats.items()}
with torch.inference_mode():
out = model(**feats)
return out.embeddings.to(torch.bfloat16).cpu()
q_emb = encode_queries(queries)
v_emb = encode_videos(videos)
scores = processor.score_multi_vector(q_emb, v_emb)
print(scores)
```
-----
## ⚖️ Strengths & Limitations
### Strengths
* **Performance:** State of the art retrieval performance on ViDoRe V2 & V3 dataset with excellent performance on multimodal document retrieval.
* **Complex Layouts:** Excellent handling of chart-rich PDFs, domain-specific documents.
* **End-to-end Retrieval:** Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
* **Retrieval Task Transfer:** Inherited strong text retrieval performance from the merged vector of the Qwen3-Embedding-8B model.
* **Multilingualism:** Strong performance on non-English document inputs.
### Limitations
* **Video Support:** The retrieval model generalizes to video retrieval on our preliminary findings, however it's not fine-tuned on large-scale video retrieval datasets, we plan to further improve this in the future.
* **Storage Cost:** Still larger than single‑vector baselines despite the smaller token dimension.
* **Retrieval Instructions:** The model currently is not fine-tuned with diverse system instructions similar to Qwen3-Embedding models, we intent to improve this with more synthetic dataset in the future.
### License & Data
Distributed under **Apache 2.0**.
* **Weights:** Upstream Qwen checkpoints retain their community licenses; ensure compliance when mixing.
* **Data:** Training data includes ViDoRe/MTEB corpora and synthetic VisRAG assets.
### Acknowledgement
We gratefully acknowledge the support of **[Tomoro AI](https://tomoro.ai/)**, a leading AI engineering firm dedicated to delivering high-quality enterprise solutions that accelerate complex R&D and business transformation. This work is directly applied to enhance Tomoro’s customized multimodal agentic RAG pipelines, empowering the autonomous agents to parse, reason over, and retrieve from large-scale enterprise **internal documentation**. By bridging the gap between vision and language, this model supports Tomoro AI's mission to **accelerate the delivery of high-quality** enterprise multimodal solutions and deploy robust, production-grade intelligence across high-stakes industries.
## 📚 Citation
If you use this model, please cite:
```bibtex
@misc{huang2025beyond,
author = {Huang, Xin and Tan, Kye Min},
title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
year = {2025},
url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
publisher = {Tomoro.ai}
}
```