-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2511.21631
-
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 202 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
Step-DeepResearch Technical Report
Paper • 2512.20491 • Published • 80 -
Deep Research: A Systematic Survey
Paper • 2512.02038 • Published • 66
-
Grounding Computer Use Agents on Human Demonstrations
Paper • 2511.07332 • Published • 105 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
Step-GUI Technical Report
Paper • 2512.15431 • Published • 128 -
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Paper • 2512.22047 • Published • 26
-
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Paper • 2512.02835 • Published • 9 -
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image
Paper • 2512.05044 • Published • 16 -
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Paper • 2512.05591 • Published • 16 -
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
Paper • 2512.05343 • Published • 24
-
PubTables-1M: Towards comprehensive table extraction from unstructured documents
Paper • 2110.00061 • Published • 3 -
Optimized Table Tokenization for Table Structure Recognition
Paper • 2305.03393 • Published • 1 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 111
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 202 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
Step-DeepResearch Technical Report
Paper • 2512.20491 • Published • 80 -
Deep Research: A Systematic Survey
Paper • 2512.02038 • Published • 66
-
Grounding Computer Use Agents on Human Demonstrations
Paper • 2511.07332 • Published • 105 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
Step-GUI Technical Report
Paper • 2512.15431 • Published • 128 -
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Paper • 2512.22047 • Published • 26
-
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Paper • 2512.02835 • Published • 9 -
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image
Paper • 2512.05044 • Published • 16 -
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Paper • 2512.05591 • Published • 16 -
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
Paper • 2512.05343 • Published • 24
-
PubTables-1M: Towards comprehensive table extraction from unstructured documents
Paper • 2110.00061 • Published • 3 -
Optimized Table Tokenization for Table Structure Recognition
Paper • 2305.03393 • Published • 1 -
Qwen3-VL Technical Report
Paper • 2511.21631 • Published • 148 -
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper • 2510.14528 • Published • 111