HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding Paper ⢠2601.14724 ⢠Published 16 days ago ⢠74
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning Paper ⢠2512.24330 ⢠Published Dec 30, 2025 ⢠35
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue Paper ⢠2510.13747 ⢠Published Oct 15, 2025 ⢠30
CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving Paper ⢠2510.07944 ⢠Published Oct 9, 2025 ⢠25
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Paper ⢠2508.21496 ⢠Published Aug 29, 2025 ⢠55
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models Paper ⢠2504.15279 ⢠Published Apr 21, 2025 ⢠78
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper ⢠2504.10479 ⢠Published Apr 14, 2025 ⢠306
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Paper ⢠2503.19757 ⢠Published Mar 25, 2025 ⢠51
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction Paper ⢠2502.11663 ⢠Published Feb 17, 2025 ⢠40