MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation Paper β’ 2605.20183 β’ Published 9 days ago β’ 14
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks Paper β’ 2507.11336 β’ Published Jul 15, 2025 β’ 8
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning Paper β’ 2512.19687 β’ Published Dec 22, 2025 β’ 3
view article Article SigLIP 2: A better multilingual vision language encoder +1 ariG23498, merve, qubvel-hf β’ Feb 21, 2025 β’ 214
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement Paper β’ 2512.13303 β’ Published Dec 15, 2025 β’ 17
Emu3.5 Collection Native Multimodal Models are World Learners π β’ 4 items β’ Updated Feb 4 β’ 77
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance Paper β’ 2510.24711 β’ Published Oct 28, 2025 β’ 20
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer Paper β’ 2509.24695 β’ Published Sep 29, 2025 β’ 53
Diffusion Transformers with Representation Autoencoders Paper β’ 2510.11690 β’ Published Oct 13, 2025 β’ 171
Running 209 Video Generation Leaderboard π 209 Text to Video and Image to Video Arena & Leaderboard
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper β’ 2411.10442 β’ Published Nov 15, 2024 β’ 87