ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Paper Link: https://aclanthology.org/2025.emnlp-industry.145/

The abstract of the paper states that:

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

Usage

Install colpali-engine:

pip install colpali-engine>=0.3.0,<0.4.0

Then run the following code:

from typing import cast

import torch
from PIL import Image

from colpali_engine.models import ColPali, ColPaliProcessor

model_name = "ahmed-masry/ColMate-3B"

model = ColPali.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
).eval()

processor = ColPaliProcessor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "Are Benjamin, Antoine, Merve, and Jo best friends?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

License

We release this model under the Gemma license of the base PaliGemma model.

Acknowledgement

We appreciate the well-documented training and evaluation GitHub repositories provided by the ColPali team, which were essential in our model development. This model card is adapted from ColPali Model Card

Contact

If you have any questions about this work, feel free to reach out to Ahmed Masry at masry20@yorku.ca or ahmed.elmasry24653@gmail.com.

Citation

If you plan to use ColMate in your research, please consider citing us as follows:

@inproceedings{masry-etal-2025-colmate,
    title = "{C}ol{M}ate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval",
    author = "Masry, Ahmed  and
      Thakkar, Megh  and
      Bechard, Patrice  and
      Madhusudhan, Sathwik Tejaswi  and
      Awal, Rabiul  and
      Mishra, Shambhavi  and
      Suresh, Akshay Kalkunte  and
      Daruru, Srivatsava  and
      Hoque, Enamul  and
      Gella, Spandana  and
      Scholak, Torsten  and
      Rajeswar, Sai",
    editor = "Potdar, Saloni  and
      Rojas-Barahona, Lina  and
      Montella, Sebastien",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = nov,
    year = "2025",
    address = "Suzhou (China)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-industry.145/",
    doi = "10.18653/v1/2025.emnlp-industry.145",
    pages = "2071--2080",
    ISBN = "979-8-89176-333-3",
    abstract = "Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61{\%} improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks."
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support