Update README.md

4965843 verified about 1 month ago

18.5 kB

	---
	license: apache-2.0
	license_name: apache-2.0
	license_link: https://www.apache.org/licenses/LICENSE-2.0
	tags:
	- text
	- image
	- video
	- multimodal-embedding
	- vidore
	- colpali
	- colqwen3
	- multilingual-embedding
	language:
	- multilingual
	library_name: transformers
	pipeline_tag: visual-document-retrieval
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	---

	# TomoroAI/tomoro-colqwen3-embed-8b

	## ⚡ Executive Summary

	TomoroAI/tomoro-colqwen3-embed-8b is a state-of-the-art [ColPali](https://arxiv.org/abs/2407.01449)-style multimodal embedding model. It maps text queries, visual documents (images, PDFs) or short videos into aligned multi-vector embeddings.

	Built by merging [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) with [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B), this model inherits robust text retrieval capabilities while preserving a full vision stack. It has been fine-tuned on a curated mixture of [VDR](https://huggingface.co/datasets/vdr-multilingual-train), [ViDoRe-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), [VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data), and [VisRAG-Ret-Train-In-domain-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-In-domain-data). It achieves SOTA or competitive performance across ViDoRe V1-V3 (English and Multilingual) while offering a significantly reduced embedding footprint compared to other full-dim Colpali model alternatives.

	## 🛠️ Model Specifications

	\| Feature \| Detail \|
	\| :--- \| :--- \|
	\| Architecture \| Qwen3-VL 8B (Encoder-only variant) + 320-dim Projection Head \|
	\| Methodology \| ColPali-style Late Interaction (MaxSim scoring) \|
	\| Token Budget \| Up to 1,280 visual tokens per page or 5120 visual tokens per video (text prompts constrained only by the base context window) \|
	\| Context Window \| 32k (inherited from base), typical usage < 2k tokens \|
	\| Output \| Multi-vector (Seq_Len × 320), L2-normalized \|
	\| Supported Modalities \| Text Queries, RGB Images, Synthetic Documents, Short Video (Frame-wise) \|
	\| Precision \| `bfloat16` weights, FlashAttention 2 enabled \|



	### Key Properties

	* Merged Encoders: Combines the Qwen3-VL vision encoder (patch-grid tokens with spatial merge) and language encoder.
	* Projection: A custom 320-dim head projects every token (text or visual) into a vector.
	* Processing:
	* Queries: Left-padded text sequences.
	* Documents: Rendered with a lightweight vision prompt and flattened into image tokens.
	* Video: Supports video retrieval by decoding videos into frames and processing via the vision stack (generalization capability, not explicitly fine-tuned; dedicated benchmark coming soon).
	* Storage Efficiency:
	* Baseline (NVIDIA Nemo-3B): Stores 1,802 tokens @ 3,072 dims (≈10.3 TB for 1M images).
	* Tomoro ColQwen3: Stores max 1,280 tokens @ 320 dims (≈0.82 TB for 1M images).
	* Result: 13× smaller footprint with higher performance.

	---

	## 📊 Evaluation Results

	We report results on the ViDoRe benchmark suite. The model sets new standards on multilingual and English splits on ViDoRe V2 and V3 while maintaining comparable high performance on ViDoRe V1.

	### ViDoRe V3 (Latest)

	English nDCG@5
	\| Model \| CompSci \| Energy \| FinanceEn \| FinanceFr \| HR \| Industrial \| Pharma \| Physics \| Avg \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.7443 \| 0.6491 \| 0.6823 \| 0.4546 \| 0.6421 \| 0.5766 \| 0.6665 \| 0.4747 \| 0.6113 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.7419 \| 0.6023 \| 0.6753 \| 0.4202 \| 0.6037 \| 0.5787 \| 0.6612 \| 0.4640 \| 0.5934 \|
	\| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) \| 0.7514 \| 0.5838 \| 0.6712 \| 0.3730 \| 0.6256 \| 0.5447 \| 0.6524 \| 0.4128 \| 0.5769 \|
	\| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) \| 0.7175 \| 0.5842 \| 0.6417 \| 0.3859 \| 0.6206 \| 0.5443 \| 0.6303 \| 0.4191 \| 0.5680 \|
	\| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) \| 0.7528 \| 0.5824 \| 0.6041 \| 0.3877 \| 0.6060 \| 0.5229 \| 0.6226 \| 0.4423 \| 0.5651 \|

	Multilingual nDCG@5 (Excluding English Subsets)
	\| Model \| CompSci \| Energy \| FinanceEn \| FinanceFr \| HR \| Industrial \| Pharma \| Physics \| Avg \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.7194 \| 0.6619 \| 0.6172 \| 0.4570 \| 0.6097 \| 0.5164 \| 0.6403 \| 0.4706 \| 0.5866 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.7213 \| 0.6374 \| 0.6019 \| 0.4305 \| 0.5637 \| 0.5131 \| 0.6351 \| 0.4636 \| 0.5708 \|
	\| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) \| 0.7216 \| 0.5901 \| 0.5646 \| 0.4102 \| 0.5504 \| 0.4335 \| 0.6170 \| 0.4192 \| 0.5383 \|
	\| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) \| 0.6843 \| 0.6036 \| 0.5482 \| 0.4249 \| 0.5542 \| 0.4732 \| 0.6059 \| 0.4381 \| 0.5416 \|
	\| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) \| 0.7333 \| 0.6160 \| 0.5219 \| 0.4169 \| 0.5494 \| 0.4764 \| 0.5938 \| 0.4449 \| 0.5441 \|

	### ViDoRe V2

	English nDCG@5
	\| Model \| BioMed \| ESG HL \| ESG Rpts \| Economics \| Avg \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.6784 \| 0.7598 \| 0.6549 \| 0.6159 \| 0.6772 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.6718 \| 0.7465 \| 0.6300 \| 0.5910 \| 0.6598 \|
	\| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) \| 0.6518 \| 0.7538 \| 0.6030 \| 0.6619 \| 0.6676 \|
	\| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) \| 0.6359 \| 0.6512 \| 0.5194 \| 0.5955 \| 0.6005 \|
	\| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) \| 0.6479 \| 0.6871 \| 0.5498 \| 0.5955 \| 0.6201 \|

	Multilingual nDCG@5
	\| Model \| BioMed \| ESG Rpts \| Economics \| Avg \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.6467 \| 0.5911 \| 0.5875 \| 0.6085 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.6478 \| 0.6226 \| 0.5536 \| 0.6080 \|
	\| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) \| 0.6187 \| 0.5640 \| 0.5506 \| 0.5778 \|
	\| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) \| 0.5994 \| 0.5178 \| 0.5364 \| 0.5512 \|
	\| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) \| 0.6224 \| 0.5336 \| 0.5433 \| 0.5664 \|

	### ViDoRe V1 (English nDCG@5)

	\| Model \| ArxivQA \| DocVQA \| InfoVQA \| Shift \| Syn-AI \| Syn-Eng \| Syn-Gov \| Syn-Health \| TabFQuAD \| Tatdqa \| Avg \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.9115 \| 0.6637 \| 0.9448 \| 0.8789 \| 0.9926 \| 0.9671 \| 0.9758 \| 0.9906 \| 0.9423 \| 0.8092 \| 0.9076 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.9066 \| 0.6624 \| 0.9429 \| 0.8739 \| 0.9926 \| 0.9691 \| 0.9717 \| 0.9963 \| 0.9433 \| 0.7983 \| 0.9057 \|
	\| [nemo-colembed-3b](https://huggingface.co/nvidia/llama-nemoretriever-colembed-3b-v1) \| 0.8835 \| 0.6621 \| 0.9492 \| 0.9070 \| 0.9963 \| 0.9663 \| 0.9782 \| 0.9926 \| 0.9594 \| 0.8057 \| 0.9100 \|
	\| [jinaai/jina-embeddings-v4](https://huggingface.co/jinaai/jina-embeddings-v4) \| 0.8846 \| 0.6014 \| 0.9379 \| 0.9293 \| 0.9926 \| 0.9726 \| 0.9659 \| 0.9913 \| 0.9560 \| 0.8035 \| 0.9035 \|
	\| [nomic-ai/colnomic-embed-multimodal-7b](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) \| 0.8832 \| 0.6011 \| 0.9221 \| 0.8930 \| 0.9876 \| 0.9626 \| 0.9592 \| 0.9926 \| 0.9596 \| 0.8108 \| 0.8972 \|


	### Video Retrieval Evaluation

	To demonstrate that Tomoro ColQwen3 strongly generalizes to video retrieval, we evaluated the models on the [CareBench](https://carebench.github.io/) for text to video (General Retrieval) task and [MMEB-V2](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) video_ret benchmark.

	#### CareBench Evaluation

	For this evaluation, we utilized a raw video encoding approach: our models encoded the video files directly without any additional textual annotations or metadata inputs. This highlights the model's ability to perform retrieval based purely on visual semantics.

	\| Model \| Recall@1 \| Recall@5 \| Recall@10 \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 0.8670 \| 0.9590 \| 0.9850 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 0.8620 \| 0.9570 \| 0.9800 \|
	\| [Care7B](https://huggingface.co/MCG-NJU/CaRe-7B) \| 0.7700 \| 0.9560 \| 0.9870 \|

	#### MMEB-V2 video_ret Evaluation

	All below evaluations are using Hit@1 metric.

	\| Model \| MSR-VTT \| MSVD \| DiDeMo \| VATEX \| YouCook2 \| Average \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| [tomoro-colqwen3-8b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b) \| 50.3 \| 71.2 \| 58.8 \| 48.0 \| 27.8 \| 51.2 \|
	\| [tomoro-colqwen3-4b](https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-4b) \| 51.1 \| 72.3 \| 59.5 \| 49.0 \| 26.6 \| 51.7 \|
	\| [IFM-TTE-7B](https://interestfm-tte.github.io/) \| 52.7 \| 73.1 \| 49.7 \| 51.5 \| 31.6 \| 51.7 \|
	\| [seed-1.6-embedding](https://seed1-6-embedding.github.io/) \| 55.3 \| 71.3 \| 56.7 \| 48.8 \| 24.6 \| 51.3 \|

	`IFM-TTE-7B` and `seed-1.6-embedding` utilize video-text fine-tuning, whereas the Tomoro ColQwen series relies solely on image-text data.

	---

	## 💻 Usage

	The processor exposes `process_texts`, `process_images`, and `score_multi_vector`.

	### Prerequisites

	We strongly suggest `flash-attn` to be installed. If not, please change to `attention_impl="sdpa"`

	Currently we only support torch==2.8.0, for higher pytorch version, please build flash attention manually, otherwise performance throughput could be low.

	```bash
	pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
	pip install transformers pillow requests
	pip install flash-attn --no-build-isolation
	```

	### Inference Code

	```python
	import torch
	from transformers import AutoModel, AutoProcessor
	from PIL import Image, UnidentifiedImageError
	import requests
	from io import BytesIO

	# Configuration
	MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
	DTYPE = torch.bfloat16
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	# Load Model & Processor
	processor = AutoProcessor.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	max_num_visual_tokens=1280,
	)
	model = AutoModel.from_pretrained(
	MODEL_ID,
	dtype=DTYPE,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	device_map=DEVICE,
	).eval()

	# Sample Data
	queries = [
	"Retrieve the city of Singapore",
	"Retrieve the city of Beijing",
	"Retrieve the city of London",
	]
	docs = [
	"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
	"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
	"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
	]

	def load_image(url: str) -> Image.Image:
	# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
	for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
	resp = requests.get(url, headers=headers, timeout=10)
	if resp.status_code == 403:
	continue
	resp.raise_for_status()
	try:
	return Image.open(BytesIO(resp.content)).convert("RGB")
	except UnidentifiedImageError as e:
	raise RuntimeError(f"Failed to decode image from {url}") from e
	raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

	# Helper Functions
	def encode_queries(texts, batch_size=8):
	outputs = []
	for start in range(0, len(texts), batch_size):
	batch = processor.process_texts(texts=texts[start : start + batch_size])
	batch = {k: v.to(DEVICE) for k, v in batch.items()}
	with torch.inference_mode():
	out = model(**batch)
	vecs = out.embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	def encode_docs(urls, batch_size=4):
	pil_images = [load_image(url) for url in urls]
	outputs = []
	for start in range(0, len(pil_images), batch_size):
	batch_imgs = pil_images[start : start + batch_size]
	features = processor.process_images(images=batch_imgs)
	features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
	with torch.inference_mode():
	out = model(**features)
	vecs = out.embeddings.to(torch.bfloat16).cpu()
	outputs.extend(vecs)
	return outputs

	# Execution
	query_embeddings = encode_queries(queries)
	doc_embeddings = encode_docs(docs)

	# MaxSim Scoring
	scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
	print(scores)
	```

	### 🎞️ Lightweight Video Retrieval

	ColQwen3 generalizes to short videos while learning from image-text retrieval task. This minimal example samples a clip with `torchvision`, encodes queries and frames, then pools frame embeddings with a per-dimension max before MaxSim scoring.

	We recommand use of maximum 5120 visual tokens for video retrieval task for best performance.

	```python
	from pathlib import Path

	import torch
	from transformers import AutoModel, AutoProcessor

	MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
	DTYPE = torch.bfloat16
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	processor = AutoProcessor.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	max_num_visual_tokens=5120,
	)
	model = AutoModel.from_pretrained(
	MODEL_ID,
	dtype=DTYPE,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	device_map=DEVICE,
	).eval()

	queries = ["Retrieve the football video", "Find the basketball clip", "Find the swimming clip", "Find the wrestling clip"]
	videos = ["/root/sample_videos/football.mp4", "/root/sample_videos/basketball.mp4", "/root/sample_videos/swimming.mp4", "/root/sample_videos/wrestling.mp4"]


	def encode_queries(texts):
	batch = processor.process_texts(texts=texts)
	batch = {k: v.to(DEVICE) for k, v in batch.items()}
	with torch.inference_mode():
	out = model(**batch)
	return out.embeddings.to(torch.bfloat16).cpu()

	def encode_videos(paths):
	vids = [str(Path(p).expanduser()) for p in paths]
	feats = processor(
	videos=vids,
	padding="longest",
	return_tensors=None, # keep metadata as Python objects until we drop it
	videos_kwargs={"return_metadata": True},
	)
	feats.pop("video_metadata", None) # drop metadata before forwarding to the model
	feats = feats.convert_to_tensors(tensor_type="pt")
	feats = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in feats.items()}
	with torch.inference_mode():
	out = model(**feats)
	return out.embeddings.to(torch.bfloat16).cpu()

	q_emb = encode_queries(queries)
	v_emb = encode_videos(videos)
	scores = processor.score_multi_vector(q_emb, v_emb)
	print(scores)
	```

	-----

	## ⚖️ Strengths & Limitations

	### Strengths

	* Performance: State of the art retrieval performance on ViDoRe V2 & V3 dataset with excellent performance on multimodal document retrieval.
	* Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
	* End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
	* Retrieval Task Transfer: Inherited strong text retrieval performance from the merged vector of the Qwen3-Embedding-8B model.
	* Multilingualism: Strong performance on non-English document inputs.

	### Limitations

	* Video Support: The retrieval model generalizes to video retrieval on our preliminary findings, however it's not fine-tuned on large-scale video retrieval datasets, we plan to further improve this in the future.
	* Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.
	* Retrieval Instructions: The model currently is not fine-tuned with diverse system instructions similar to Qwen3-Embedding models, we intent to improve this with more synthetic dataset in the future.

	### License & Data

	Distributed under Apache 2.0.

	* Weights: Upstream Qwen checkpoints retain their community licenses; ensure compliance when mixing.
	* Data: Training data includes ViDoRe/MTEB corpora and synthetic VisRAG assets.

	### Acknowledgement

	We gratefully acknowledge the support of [Tomoro AI](https://tomoro.ai/), a leading AI engineering firm dedicated to delivering high-quality enterprise solutions that accelerate complex R&D and business transformation. This work is directly applied to enhance Tomoro’s customized multimodal agentic RAG pipelines, empowering the autonomous agents to parse, reason over, and retrieve from large-scale enterprise internal documentation. By bridging the gap between vision and language, this model supports Tomoro AI's mission to accelerate the delivery of high-quality enterprise multimodal solutions and deploy robust, production-grade intelligence across high-stakes industries.


	## 📚 Citation

	If you use this model, please cite:

	```bibtex
	@misc{huang2025beyond,
	author = {Huang, Xin and Tan, Kye Min},
	title = {Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3},
	year = {2025},
	url = {https://tomoro.ai/insights/beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3},
	publisher = {Tomoro.ai}
	}
	```