NanoVDR: A 70M Text-Only Model That Retrieves Visual Documents as Well as a 2B VLM
Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval
What if the best way to search through visual documents... is to not look at them at all?
Visual Document Retrieval (VDR) has made impressive progress recently. Models like ColPali (3B), DSE-Qwen2 (2B), and Tomoro-8B (8B) can search through document images—PDFs, financial reports, scientific papers—using natural language queries. But there's a catch: these models use the same multi-billion parameter Vision-Language Model (VLM) to encode both documents and queries. That means even a simple text query like "What was Q3 revenue?" needs to pass through a 2–8B parameter VLM. On a single CPU thread, that's 2.5 to 8 seconds per query.
We asked ourselves: does a text query really need a vision model?
The Key Insight: Queries Don't Have Eyes
Here's the fundamental asymmetry we exploit: documents are visual, but queries are just text.
A document page can contain charts, tables, diagrams, equations, multi-column layouts—it genuinely needs a powerful vision model to understand. But the query? It's a short text string. There is no visual information in "Show me the revenue breakdown by segment". So why are we running it through a 2 billion parameter vision model?
NanoVDR takes this asymmetry seriously. We keep the heavy VLM teacher (Qwen3-VL-Embedding-2B) for offline document indexing—it runs once per document, on a GPU, and we store the results. But for online query encoding, we distill the teacher's knowledge into a 69M parameter DistilBERT that runs on CPU in 51 milliseconds.
The student never sees a single image during training or inference. It learns to map text queries into the teacher's visual embedding space using only text. And it works remarkably well.
How It Works: Simpler Than You'd Think
The training pipeline is almost embarrassingly simple:
- Pre-cache teacher embeddings: Run the frozen VLM teacher on all training queries (text mode—no images needed) to get target embeddings.
- Train the student: A DistilBERT backbone + mean pooling + MLP projector learns to produce embeddings that are close to the teacher's, via cosine similarity loss.
- Done. Total training cost: under 13 GPU-hours.
At inference, documents are pre-indexed by the teacher (one-time offline cost), and queries go through the 69M student on CPU. Retrieval is a plain dot product—fully FAISS-compatible, no MaxSim pooling needed.
Each document page is stored as a single 2048-dim vector (4 KB in float16). Compare that to multi-vector models like ColPali that store ~1,030 token vectors per page: NanoVDR's index is 64x more storage-efficient.
The Surprising Finding: Just Align, Don't Rank
Here's where it gets interesting. We ran a ablation setup with: 6 loss functions × 3 text encoder backbones × 22 datasets (ViDoRe v1v2v3), all trained to convergence.
The loss functions span the full spectrum:
- Pure alignment (cosine distance between student and teacher query embeddings)
- Pure ranking (KL-divergence over teacher's in-batch similarity distributions)
- Various combinations of the two
- InfoNCE (standard contrastive learning with hard labels)
The result? A perfectly monotonic trend: the more alignment you add and the less ranking you use, the better the performance. Pure alignment wins on every backbone, every benchmark.
| Loss | ViDoRe v1 (10) | ViDoRe v2 (4) | ViDoRe v3 (8) |
|---|---|---|---|
| Align-only | 82.2 | 61.4 | 44.1 |
| Align-dominant | 81.6 | 59.8 | 42.8 |
| Combined | 81.5 | 59.1 | 42.5 |
| Rank-dominant | 81.5 | 58.6 | 42.1 |
| Rank-only | 81.1 | 57.4 | 41.6 |
| InfoNCE | 71.5 | 39.8 | 30.0 |
NDCG@5 (×100), averaged over 3 backbones (DistilBERT, BERT-base, ModernBERT). All trained identically for 20 epochs.
This is surprising because KL-divergence ranking is the standard approach in retrieval distillation (TAS-B, MarginMSE, etc.). But in our setting, the teacher's embedding space is so well-structured that directly aligning to its coordinates is more informative than mimicking its ranking distributions.
Even more surprising: the InfoNCE baseline—standard contrastive learning with hard one-hot labels—collapses, losing 10–22 points compared to alignment. The teacher's "dark knowledge" (the continuous geometry of its embedding space) is the critical ingredient. Hard labels throw it away.
And there's a practical bonus: alignment-only training needs only teacher query embeddings (text-encoded). No teacher document embeddings, no negative sampling, no corpus-level processing. This cuts training cost roughly in half.
The Real Bottleneck: It's Not Vision, It's Language
We were curious: if a text-only model can retrieve visual documents so well, what's the actual bottleneck?
We analyzed retention (student NDCG@5 / teacher NDCG@5) across all 19,537 evaluation queries grouped by language:
| Language | Training Data | Retention |
|---|---|---|
| English | 68.7% | 94.3% |
| French | 7.6% | 92.1% |
| Italian | 7.6% | 90.0% |
| Spanish | 8.1% | 89.7% |
| German | 8.0% | 85.7% |
| Portuguese | 0.0% | 75.6% |
The pattern is clear: retention tracks training data coverage, not document visual complexity. English queries retain 94.3% of teacher quality across all document types—charts, tables, diagrams, dense text. Portuguese queries, completely absent from training, retain only 75.6% on the exact same documents.
The bottleneck isn't that DistilBERT can't "see"—it's that DistilBERT can't speak Portuguese.
Closing the Gap with Translated Queries
Since alignment training is purely query-centric (the student only learns from text queries, never from document images), fixing the language gap is trivially cheap:
- Translate ~489K English training queries into 5 languages using Helsinki-NLP Opus-MT models
- Re-encode translated queries with the frozen teacher (text mode—no images)
- Retrain the student on 1.49M pairs (original + translations)
The results are exactly as predicted:
| Language | Before | After | Change |
|---|---|---|---|
| English | 60.3 | 60.3 | +0.0 |
| French | 51.7 | 53.2 | +1.5 |
| German | 42.3 | 45.4 | +3.1 |
| Portuguese | 36.8 | 46.1 | +9.3 |
English is untouched. Portuguese gains +9.3 NDCG@5. After augmentation, every language achieves >92% retention. The maximum cross-lingual gap narrows from 18.6 pp to just 2.7 pp.
How NanoVDR Stacks Up
Here are the full results on the ViDoRe benchmark (22 datasets across 3 ViDoRe versions, NDCG@5):
| Model | Params | Scoring | v1 (10) | v2 (4) | v3 (8) |
|---|---|---|---|---|---|
| Tomoro-8B | 8.0B | MaxSim | 90.6 | 65.0 | 59.0 |
| ColNomic-7B | 7.0B | MaxSim | 89.8 | 60.4 | 55.9 |
| ColPali | 3.0B | MaxSim | 84.2 | 54.7 | 42.0 |
| DSE-Qwen2 | 2.0B | Cosine | 85.1 | 55.7 | 41.3 |
| Teacher (Qwen3-VL) | 2.0B | Cosine | 84.3 | 65.3 | 50.0 |
| NanoVDR-S-Multi | 69M | Cosine | 82.2 | 61.9 | 46.5 |
| NanoVDR-M | 112M | Cosine | 82.1 | 62.2 | 44.7 |
| NanoVDR-L | 151M | Cosine | 82.4 | 61.5 | 44.2 |
NanoVDR-S-Multi (69M) outperforms DSE-Qwen2 (2B) on v2 (+6.2) and v3 (+5.2) with 32× fewer parameters. It even beats ColPali (3B) on v2 (+7.2) and v3 (+4.5)—a multi-vector model with MaxSim scoring.
And the efficiency numbers are dramatic:
| NanoVDR-S | DSE-Qwen2 | ColPali | |
|---|---|---|---|
| Params | 69M | 2,209M (32×) | 2,964M (43×) |
| Query latency (CPU) | 51 ms | 2,539 ms (50×) | 7,284 ms (143×) |
| Model size | 274 MB | 8.8 GB | 11.9 GB |
| Index/1M pages | 8.2 GB | 6.1 GB | 264 GB |
Data Efficiency: Less Is (Almost) Enough
We trained NanoVDR-S on random subsets of our 711K training pairs (without translation augmentation) to measure data efficiency:
At just 25% of our whole training data (178K pairs), NanoVDR-S already reaches 93% teacher retention on v1. The alignment objective is highly data-efficient—it directly learns the teacher's continuous embedding geometry rather than needing to discover ranking boundaries from discrete labels.
Try It Yourself
NanoVDR is fully open-source and easy to use:
from sentence_transformers import SentenceTransformer
# Load the 69M student (works on CPU!)
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048)
# Retrieve via cosine similarity against pre-indexed document embeddings
# scores = query_emb @ doc_embeddings.T
- Interactive Demo — try queries against 1,360 real computer science document pages
- Models — NanoVDR-S, S-Multi, M, L
What This Means for the Community
NanoVDR demonstrates something we think is broadly useful: you don't always need a symmetric architecture for asymmetric VDR problems.
VDR is inherently asymmetric—documents are visual, queries are text. By embracing this asymmetry rather than fighting it, we get:
- 50–143× faster query encoding (CPU, single-thread)
- 64× less index storage (single vector vs. multi-vector)
- 95.1% teacher retention with 29× fewer parameters
- Under 13 GPU-hours total training cost
The approach might generalize beyond VDR. Any retrieval task where queries and documents live in fundamentally different modalities could benefit from asymmetric distillation—audio search, video retrieval, cross-lingual IR with lightweight query encoders. The key ingredient is a strong teacher with a well-structured embedding space.
We'd love to hear from the community—especially if you try NanoVDR in production or adapt the asymmetric distillation framework to other tasks. Feel free to reach out on the model page.
Acknowledgements
This project has received funding from the Business Finland co-innovation programme under grant agreement No. 69/31/2025. It is supported by the AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.



