See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Paper • 2605.18018 • Published 12 days ago • 32
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics Paper • 2602.12617 • Published Feb 13 • 20
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Paper • 2605.18018 • Published 12 days ago • 32
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Paper • 2605.18018 • Published 12 days ago • 32
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation Paper • 2604.25819 • Published Apr 28 • 17
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics Paper • 2602.12617 • Published Feb 13 • 20
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Paper • 2508.01548 • Published Aug 3, 2025 • 14 • 2
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models Paper • 2508.01548 • Published Aug 3, 2025 • 14
Towards RAW Object Detection in Diverse Conditions Paper • 2411.15678 • Published Nov 24, 2024 • 1
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness Paper • 2501.07978 • Published Jan 14, 2025 • 1