NanoVDR: A 70M Text-Only Model That Retrieves Visual Documents as Well as a 2B VLM

Community Article Published March 16, 2026

Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

What if the best way to search through visual documents... is to not look at them at all?

Visual Document Retrieval (VDR) has made impressive progress recently. Models like ColPali (3B), DSE-Qwen2 (2B), and Tomoro-8B (8B) can search through document images—PDFs, financial reports, scientific papers—using natural language queries. But there's a catch: these models use the same multi-billion parameter Vision-Language Model (VLM) to encode both documents and queries. That means even a simple text query like "What was Q3 revenue?" needs to pass through a 2–8B parameter VLM. On a single CPU thread, that's 2.5 to 8 seconds per query.

We asked ourselves: does a text query really need a vision model?

The Key Insight: Queries Don't Have Eyes

Here's the fundamental asymmetry we exploit: documents are visual, but queries are just text.

A document page can contain charts, tables, diagrams, equations, multi-column layouts—it genuinely needs a powerful vision model to understand. But the query? It's a short text string. There is no visual information in "Show me the revenue breakdown by segment". So why are we running it through a 2 billion parameter vision model?

NanoVDR takes this asymmetry seriously. We keep the heavy VLM teacher (Qwen3-VL-Embedding-2B) for offline document indexing—it runs once per document, on a GPU, and we store the results. But for online query encoding, we distill the teacher's knowledge into a 69M parameter DistilBERT that runs on CPU in 51 milliseconds.

The student never sees a single image during training or inference. It learns to map text queries into the teacher's visual embedding space using only text. And it works remarkably well.

How It Works: Simpler Than You'd Think

The training pipeline is almost embarrassingly simple:

Pre-cache teacher embeddings: Run the frozen VLM teacher on all training queries (text mode—no images needed) to get target embeddings.
Train the student: A DistilBERT backbone + mean pooling + MLP projector learns to produce embeddings that are close to the teacher's, via cosine similarity loss.
Done. Total training cost: under 13 GPU-hours.

At inference, documents are pre-indexed by the teacher (one-time offline cost), and queries go through the 69M student on CPU. Retrieval is a plain dot product—fully FAISS-compatible, no MaxSim pooling needed.

Each document page is stored as a single 2048-dim vector (4 KB in float16). Compare that to multi-vector models like ColPali that store ~1,030 token vectors per page: NanoVDR's index is 64x more storage-efficient.

The Surprising Finding: Just Align, Don't Rank

Here's where it gets interesting. We ran a ablation setup with: 6 loss functions × 3 text encoder backbones × 22 datasets (ViDoRe v1v2v3), all trained to convergence.

The loss functions span the full spectrum:

Pure alignment (cosine distance between student and teacher query embeddings)
Pure ranking (KL-divergence over teacher's in-batch similarity distributions)
Various combinations of the two
InfoNCE (standard contrastive learning with hard labels)

The result? A perfectly monotonic trend: the more alignment you add and the less ranking you use, the better the performance. Pure alignment wins on every backbone, every benchmark.

Loss	ViDoRe v1 (10)	ViDoRe v2 (4)	ViDoRe v3 (8)
Align-only	82.2	61.4	44.1
Align-dominant	81.6	59.8	42.8
Combined	81.5	59.1	42.5
Rank-dominant	81.5	58.6	42.1
Rank-only	81.1	57.4	41.6
InfoNCE	71.5	39.8	30.0

NDCG@5 (×100), averaged over 3 backbones (DistilBERT, BERT-base, ModernBERT). All trained identically for 20 epochs.

This is surprising because KL-divergence ranking is the standard approach in retrieval distillation (TAS-B, MarginMSE, etc.). But in our setting, the teacher's embedding space is so well-structured that directly aligning to its coordinates is more informative than mimicking its ranking distributions.

Even more surprising: the InfoNCE baseline—standard contrastive learning with hard one-hot labels—collapses, losing 10–22 points compared to alignment. The teacher's "dark knowledge" (the continuous geometry of its embedding space) is the critical ingredient. Hard labels throw it away.

And there's a practical bonus: alignment-only training needs only teacher query embeddings (text-encoded). No teacher document embeddings, no negative sampling, no corpus-level processing. This cuts training cost roughly in half.

The Real Bottleneck: It's Not Vision, It's Language

We were curious: if a text-only model can retrieve visual documents so well, what's the actual bottleneck?

We analyzed retention (student NDCG@5 / teacher NDCG@5) across all 19,537 evaluation queries grouped by language:

Language	Training Data	Retention
English	68.7%	94.3%
French	7.6%	92.1%
Italian	7.6%	90.0%
Spanish	8.1%	89.7%
German	8.0%	85.7%
Portuguese	0.0%	75.6%

The pattern is clear: retention tracks training data coverage, not document visual complexity. English queries retain 94.3% of teacher quality across all document types—charts, tables, diagrams, dense text. Portuguese queries, completely absent from training, retain only 75.6% on the exact same documents.

The bottleneck isn't that DistilBERT can't "see"—it's that DistilBERT can't speak Portuguese.

Closing the Gap with Translated Queries

Since alignment training is purely query-centric (the student only learns from text queries, never from document images), fixing the language gap is trivially cheap:

Translate ~489K English training queries into 5 languages using Helsinki-NLP Opus-MT models
Re-encode translated queries with the frozen teacher (text mode—no images)
Retrain the student on 1.49M pairs (original + translations)

The results are exactly as predicted:

Language	Before	After	Change
English	60.3	60.3	+0.0
French	51.7	53.2	+1.5
German	42.3	45.4	+3.1
Portuguese	36.8	46.1	+9.3

English is untouched. Portuguese gains +9.3 NDCG@5. After augmentation, every language achieves >92% retention. The maximum cross-lingual gap narrows from 18.6 pp to just 2.7 pp.

How NanoVDR Stacks Up

Here are the full results on the ViDoRe benchmark (22 datasets across 3 ViDoRe versions, NDCG@5):

Model	Params	Scoring	v1 (10)	v2 (4)	v3 (8)
Tomoro-8B	8.0B	MaxSim	90.6	65.0	59.0
ColNomic-7B	7.0B	MaxSim	89.8	60.4	55.9
ColPali	3.0B	MaxSim	84.2	54.7	42.0
DSE-Qwen2	2.0B	Cosine	85.1	55.7	41.3
Teacher (Qwen3-VL)	2.0B	Cosine	84.3	65.3	50.0
NanoVDR-S-Multi	69M	Cosine	82.2	61.9	46.5
NanoVDR-M	112M	Cosine	82.1	62.2	44.7
NanoVDR-L	151M	Cosine	82.4	61.5	44.2

NanoVDR-S-Multi (69M) outperforms DSE-Qwen2 (2B) on v2 (+6.2) and v3 (+5.2) with 32× fewer parameters. It even beats ColPali (3B) on v2 (+7.2) and v3 (+4.5)—a multi-vector model with MaxSim scoring.

And the efficiency numbers are dramatic:

	NanoVDR-S	DSE-Qwen2	ColPali
Params	69M	2,209M (32×)	2,964M (43×)
Query latency (CPU)	51 ms	2,539 ms (50×)	7,284 ms (143×)
Model size	274 MB	8.8 GB	11.9 GB
Index/1M pages	8.2 GB	6.1 GB	264 GB

Data Efficiency: Less Is (Almost) Enough

We trained NanoVDR-S on random subsets of our 711K training pairs (without translation augmentation) to measure data efficiency:

At just 25% of our whole training data (178K pairs), NanoVDR-S already reaches 93% teacher retention on v1. The alignment objective is highly data-efficient—it directly learns the teacher's continuous embedding geometry rather than needing to discover ranking boundaries from discrete labels.

Try It Yourself

NanoVDR is fully open-source and easy to use:

from sentence_transformers import SentenceTransformer

# Load the 69M student (works on CPU!)
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"])  # (1, 2048)

# Retrieve via cosine similarity against pre-indexed document embeddings
# scores = query_emb @ doc_embeddings.T

Interactive Demo — try queries against 1,360 real computer science document pages
Models — NanoVDR-S, S-Multi, M, L

What This Means for the Community

NanoVDR demonstrates something we think is broadly useful: you don't always need a symmetric architecture for asymmetric VDR problems.

VDR is inherently asymmetric—documents are visual, queries are text. By embracing this asymmetry rather than fighting it, we get:

50–143× faster query encoding (CPU, single-thread)
64× less index storage (single vector vs. multi-vector)
95.1% teacher retention with 29× fewer parameters
Under 13 GPU-hours total training cost

The approach might generalize beyond VDR. Any retrieval task where queries and documents live in fundamentally different modalities could benefit from asymmetric distillation—audio search, video retrieval, cross-lingual IR with lightweight query encoders. The key ingredient is a strong teacher with a well-structured embedding space.

We'd love to hear from the community—especially if you try NanoVDR in production or adapt the asymmetric distillation framework to other tasks. Feel free to reach out on the model page.

Acknowledgements

This project has received funding from the Business Finland co-innovation programme under grant agreement No. 69/31/2025. It is supported by the AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote