NanoVDR
AI & ML interests
None defined yet.
Recent Activity
Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval
NanoVDR distills a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B) into tiny text-only query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in 51 ms via a DistilBERT forward pass — no vision model at query time.
Queries and documents both map to the same 2048-dim single vector inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just 4 KB per page (float16), making NanoVDR 64× more storage-efficient than multi-vector retrievers like ColPali.
Models
| Model | Backbone | Params | ViDoRe v1 | v2 | v3 | Retention | CPU Latency |
|---|---|---|---|---|---|---|---|
| NanoVDR-S-Multi ⭐ | DistilBERT | 69M | 82.2 | 61.9 | 46.5 | 95.1% | 51 ms |
| NanoVDR-S | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% | 51 ms |
| NanoVDR-M | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% | 101 ms |
| NanoVDR-L | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% | 109 ms |
NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"]) # (1, 2048)
# Retrieve via cosine similarity against teacher-indexed document embeddings
# scores = query_emb @ doc_embeddings.T
Documents must be indexed offline with the teacher VLM. See the NanoVDR-S-Multi model page for a complete guide.