NanoVDR

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Ryenhails  updated a Space about 9 hours ago
nanovdr/README
Ryenhails  published a Space about 9 hours ago
nanovdr/README
View all activity

Organization Card

NanoVDR

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval

Paper  |  GitHub  |  Dataset


NanoVDR distills a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B) into tiny text-only query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in 51 ms via a DistilBERT forward pass — no vision model at query time.

Queries and documents both map to the same 2048-dim single vector inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just 4 KB per page (float16), making NanoVDR 64× more storage-efficient than multi-vector retrievers like ColPali.

Models

Model Backbone Params ViDoRe v1 v2 v3 Retention CPU Latency
NanoVDR-S-Multi DistilBERT 69M 82.2 61.9 46.5 95.1% 51 ms
NanoVDR-S DistilBERT 69M 82.2 60.5 43.5 92.4% 51 ms
NanoVDR-M BERT-base 112M 82.1 62.2 44.7 94.0% 101 ms
NanoVDR-L ModernBERT 151M 82.4 61.5 44.2 93.4% 109 ms

NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"])  # (1, 2048)

# Retrieve via cosine similarity against teacher-indexed document embeddings
# scores = query_emb @ doc_embeddings.T

Documents must be indexed offline with the teacher VLM. See the NanoVDR-S-Multi model page for a complete guide.

datasets 0

None public yet