Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

This is the official model released for the paper Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning.

Abstract

Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous.

GitHub Repository

https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026

Model Version

Fine-R1-3B

Usage

This model can be used with the Hugging Face transformers library. For detailed usage examples and how to integrate it into your projects, please refer to the official GitHub Repository.

Downloads last month: 13

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for StevenHH2000/Fine-R1-3B

Quantizations

2 models

Collection including StevenHH2000/Fine-R1-3B

Fine-R1

Collection

Welcome to Fine-R1 👋, which is the first MLLM to surpass various strong CLIP-like models in fine-grained visual recognition. • 2 items • Updated 3 days ago • 1