🔍 Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval (CVPR 2026)
📖 Paper (arXiv) | 🌐 Homepage | 🐙 Code (GitHub) | 🤗 Dataset (OACIRR) | 🛜 Download Weights Now 👇
🔔 News
- 🔥 [2026-04-07]: The AdaFocal model checkpoints are officially released and are now available for use!
- 🔥 [2026-04-03]: The full Training/Evaluation code are officially released on GitHub!
- 🔥 [2026-03-25]: The OACIRR Benchmark is officially released on HuggingFace!
- 🎉 [2026-02-21]: Our paper "Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval" has been accepted to CVPR 2026!
🤖 Model Description
- Architecture: ViT-G (EVA-CLIP) + BLIP-2 Q-Former + Context-Aware Attention Modulator (CAAM)
- Task: Fine-grained Composed Image Retrieval (CIR) with Instance-level Consistency
- Training Data: Exclusively trained on the OACIRR Union Dataset
⚙️ AdaFocal Framework
To address the core challenges of the OACIR task, we propose AdaFocal, an effective framework that dynamically modulates visual attention for precise, instance-level retrieval. Our approach augments a multimodal fusion backbone with a lightweight Context-Aware Attention Modulator (CAAM), enabling a nuanced balance between instance fidelity and compositional reasoning.
Specifically, AdaFocal employs a two-stage reasoning process: Contextual Perception and Adaptive Focus. It first perceives the query's compositional context to predict a modulation scalar (β). This learned signal then drives an Attention Activation Mechanism, which explicitly and adaptively intensifies the visual focus on the user-specified instance region (provided via bounding box) during multimodal feature fusion.
By dynamically re-weighting the attention distribution, AdaFocal seamlessly synthesizes the anchored instance, the global visual scene, and the textual modification into a coherent representation, establishing a robust and flexible baseline for identity-preserving retrieval.
🚀 How to Use
1. Download the AdaFocal Weights
You can download the checkpoints using Git LFS:
cd OACIR
git lfs install
git clone https://huggingface.co/HaHaJun1101/AdaFocal ./checkpoints
Alternatively, download them via the Hugging Face Python API:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="HaHaJun1101/AdaFocal", local_dir="OACIR/checkpoints", repo_type="model")
2. Run Evaluation via Official Codebase
Once downloaded, you can directly evaluate the models using the evaluate.sh script provided in our GitHub codebase. Open evaluate.sh and set the path to your downloaded weights:
# Inside evaluate.sh
DATASET="Fashion"
MODEL_NAME="oacir_adafocal"
MODEL_WEIGHT="./checkpoints/adafocal_scalar.pt" # or adafocal_vector.pt
Then execute the script:
bash evaluate.sh
🏆 Model Performance on OACIRR
We provide two variants of the AdaFocal weights. You can instantly reproduce the following results using our provided evaluate.sh script.
| Model Variant | Component Type | RID@1 (Avg) | R@1 (Avg) | R@5 (Avg) | Overall Avg | Weights File |
|---|---|---|---|---|---|---|
| AdaFocal (Scalar β) | Default Configuration | 81.52 | 63.08 | 90.98 | 78.53 | adafocal_scalar.pt |
| AdaFocal (Vector β) | Vector Ablation | 81.99 | 63.06 | 91.35 | 78.80 | adafocal_vector.pt |
Detailed breakdowns across the 4 domains:
| Variant | Fashion (RID@1 / R@1) | Car (RID@1 / R@1) | Product (RID@1 / R@1) | Landmark (RID@1 / R@1) |
|---|---|---|---|---|
| Scalar β | 73.68 / 64.45 | 78.39 / 54.85 | 91.36 / 73.85 | 82.65 / 59.18 |
| Vector β | 75.71 / 65.97 | 77.97 / 54.35 | 91.39 / 73.30 | 82.90 / 58.63 |
✒️ Citation
If you find our dataset, models, or codebase useful in your research, please consider citing our paper:
@article{yang2026beyond,
title={Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval},
author={Yang, Yuxin and Zhou, Yinan and Chen, Yuxin and Zhang, Ziqi and Ma, Zongyang and Yuan, Chunfeng and Li, Bing and Gao, Jun and Hu, Weiming},
journal={arXiv preprint arXiv:2604.05393},
year={2026}
}
Model tree for HaHaJun1101/AdaFocal
Base model
Salesforce/blip2-itm-vit-g