NanoVDR

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Ryenhails authored a paper about 2 months ago

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Ryenhails submitted a paper about 2 months ago

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Ryenhails authored a paper 2 months ago

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

View all activity

Papers

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

View all Papers

Organization Card

Community About org cards

Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder
for Visual Document Retrieval

Demo | Paper | Blog | Dataset

Paper: NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

NanoVDR distills a frozen 2B VLM teacher (Qwen3-VL-Embedding-2B) into tiny text-only query encoders (69–151M parameters) for visual document retrieval. Documents are indexed offline by the teacher; queries are encoded on CPU in 51 ms via a DistilBERT forward pass — no vision model at query time.

Queries and documents both map to the same 2048-dim single vector inherited from the teacher's embedding space, so retrieval is a plain dot product — FAISS-compatible with no MaxSim pooling. The doc index stores just 4 KB per page (float16), making NanoVDR 64× more storage-efficient than multi-vector retrievers like ColPali.

Models

Model	Backbone	Params	ViDoRe v1	v2	v3	Retention	CPU Latency
NanoVDR-M-Multi ⭐	BERT-base	112M	82.5	62.8	47.5	96.4%	101 ms
NanoVDR-L-Multi ⭐	ModernBERT	151M	82.2	63.1	47.1	96.0%	109 ms
NanoVDR-S-Multi ⭐	DistilBERT	69M	82.2	61.9	46.5	95.1%	51 ms
NanoVDR-S	DistilBERT	69M	82.2	60.5	43.5	92.4%	51 ms
NanoVDR-M	BERT-base	112M	82.1	62.2	44.7	94.0%	101 ms
NanoVDR-L	ModernBERT	151M	82.4	61.5	44.2	93.4%	109 ms

_{NDCG@5 (×100) on the ViDoRe benchmark (22 datasets). Retention = Student / Teacher. Teacher = Qwen3-VL-Embedding-2B (2.0B).}

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nanovdr/NanoVDR-S-Multi")
query_emb = model.encode(["What was the revenue growth in Q3 2024?"])  # (1, 2048)

# Retrieve via cosine similarity against teacher-indexed document embeddings
# scores = query_emb @ doc_embeddings.T

Documents must be indexed offline with the teacher VLM. See the NanoVDR-S-Multi model page for a complete guide.

Acknowledgements

This project has received funding from the Business Finland co-innovation programme under grant agreement No. 69/31/2025. It is supported by the AiWo: Human-centric AI-enabled Collaborative Fieldwork Operations project (2025–2027), which aims to revolutionize fieldwork operations and enhance human-AI collaboration across the manufacturing, construction, and industrial design sectors. The calculations presented in this project were performed using computer resources within the Aalto University School of Science "Science-IT" project.