Trials, Errors, and Breakthroughs: Our Rocky Road to OVD SOTA with Reinforcement Learning Mar 25, 2025 β’ 2
VLM-FO1-Models A collection of VLM-FO1 models omlab/VLM-FO1_Qwen2.5-VL-3B-v01 Object Detection β’ 4B β’ Updated Nov 28, 2025 β’ 7.63k β’ 13
VLM-R1-models A collection of VLM-R1 Models omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps Zero-Shot Object Detection β’ 4B β’ Updated Apr 14, 2025 β’ 59 β’ 23 omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Visual Question Answering β’ 4B β’ Updated Apr 14, 2025 β’ 73 β’ 8 omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321 Image-Text-to-Text β’ 4B β’ Updated Jul 18, 2025 β’ 181 β’ 24
omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps Zero-Shot Object Detection β’ 4B β’ Updated Apr 14, 2025 β’ 59 β’ 23
omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Visual Question Answering β’ 4B β’ Updated Apr 14, 2025 β’ 73 β’ 8
Remote Sensing Referring Expression Understanding REU task for RS. omlab/VRSBench-FS Viewer β’ Updated Oct 2, 2025 β’ 16.6k β’ 103 β’ 1 omlab/NWPU-FS Viewer β’ Updated Oct 2, 2025 β’ 39 β’ 9 omlab/EarthReason-FS Viewer β’ Updated Oct 2, 2025 β’ 3.39k β’ 58 omlab/Cross_DIOR-RSVG Viewer β’ Updated Oct 2, 2025 β’ 7.42k β’ 37
Multimodal Research ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper β’ 2411.16044 β’ Published Nov 25, 2024 β’ 2 OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper β’ 2407.04923 β’ Published Jul 6, 2024 β’ 2 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper β’ 2209.05946 β’ Published Sep 10, 2022 β’ 2 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper β’ 2207.00221 β’ Published Jul 1, 2022 β’ 2
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper β’ 2411.16044 β’ Published Nov 25, 2024 β’ 2
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper β’ 2407.04923 β’ Published Jul 6, 2024 β’ 2
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper β’ 2209.05946 β’ Published Sep 10, 2022 β’ 2
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper β’ 2207.00221 β’ Published Jul 1, 2022 β’ 2
VLM-FO1-Models A collection of VLM-FO1 models omlab/VLM-FO1_Qwen2.5-VL-3B-v01 Object Detection β’ 4B β’ Updated Nov 28, 2025 β’ 7.63k β’ 13
Remote Sensing Referring Expression Understanding REU task for RS. omlab/VRSBench-FS Viewer β’ Updated Oct 2, 2025 β’ 16.6k β’ 103 β’ 1 omlab/NWPU-FS Viewer β’ Updated Oct 2, 2025 β’ 39 β’ 9 omlab/EarthReason-FS Viewer β’ Updated Oct 2, 2025 β’ 3.39k β’ 58 omlab/Cross_DIOR-RSVG Viewer β’ Updated Oct 2, 2025 β’ 7.42k β’ 37
VLM-R1-models A collection of VLM-R1 Models omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps Zero-Shot Object Detection β’ 4B β’ Updated Apr 14, 2025 β’ 59 β’ 23 omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Visual Question Answering β’ 4B β’ Updated Apr 14, 2025 β’ 73 β’ 8 omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321 Image-Text-to-Text β’ 4B β’ Updated Jul 18, 2025 β’ 181 β’ 24
omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps Zero-Shot Object Detection β’ 4B β’ Updated Apr 14, 2025 β’ 59 β’ 23
omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Visual Question Answering β’ 4B β’ Updated Apr 14, 2025 β’ 73 β’ 8
Multimodal Research ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper β’ 2411.16044 β’ Published Nov 25, 2024 β’ 2 OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper β’ 2407.04923 β’ Published Jul 6, 2024 β’ 2 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper β’ 2209.05946 β’ Published Sep 10, 2022 β’ 2 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper β’ 2207.00221 β’ Published Jul 1, 2022 β’ 2
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper β’ 2411.16044 β’ Published Nov 25, 2024 β’ 2
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper β’ 2407.04923 β’ Published Jul 6, 2024 β’ 2
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper β’ 2209.05946 β’ Published Sep 10, 2022 β’ 2
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper β’ 2207.00221 β’ Published Jul 1, 2022 β’ 2