Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2511.21631

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Foundation models

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148

Comic Panel Description

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Qwen/Qwen3-VL-8B-Thinking-FP8

Image-Text-to-Text • 9B • Updated Nov 26, 2025 • 36.3k • 30

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Paper • 2512.02395 • Published Dec 2, 2025 • 47

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 69
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper • 2512.16676 • Published 19 days ago • 202
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Step-DeepResearch Technical Report

Paper • 2512.20491 • Published 14 days ago • 80
Deep Research: A Systematic Survey

Paper • 2512.02038 • Published Nov 24, 2025 • 66

GUI automation - Computer use

Grounding Computer Use Agents on Human Demonstrations

Paper • 2511.07332 • Published Nov 10, 2025 • 105
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Step-GUI Technical Report

Paper • 2512.15431 • Published 20 days ago • 128
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Paper • 2512.22047 • Published 11 days ago • 26

about 7 hours ago

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Paper • 2512.02835 • Published Dec 2, 2025 • 9
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Paper • 2512.05044 • Published Dec 4, 2025 • 16
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Paper • 2512.05591 • Published Dec 5, 2025 • 16
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Paper • 2512.05343 • Published Dec 5, 2025 • 24

Salesforce/blip2-opt-2.7b

Image-Text-to-Text • 4B • Updated Feb 3, 2025 • 491k • 427
timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 95.2k • 1.16k
huggan/pix2pix-edge2shoes

Updated Apr 15, 2022 • 3
KaiChen1998/geodiffusion-nuimages-time-weather-512x512

Text-to-Image • Updated Dec 5, 2024 • 5

PubTables-1M: Towards comprehensive table extraction from unstructured documents

Paper • 2110.00061 • Published Sep 30, 2021 • 3
Optimized Table Tokenization for Table Structure Recognition

Paper • 2305.03393 • Published May 5, 2023 • 1
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16, 2025 • 111

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Paper • 2512.16676 • Published 19 days ago • 202
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Step-DeepResearch Technical Report

Paper • 2512.20491 • Published 14 days ago • 80
Deep Research: A Systematic Survey

Paper • 2512.02038 • Published Nov 24, 2025 • 66

Foundation models

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148

GUI automation - Computer use

Grounding Computer Use Agents on Human Demonstrations

Paper • 2511.07332 • Published Nov 10, 2025 • 105
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Step-GUI Technical Report

Paper • 2512.15431 • Published 20 days ago • 128
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Paper • 2512.22047 • Published 11 days ago • 26

Comic Panel Description

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Qwen/Qwen3-VL-8B-Thinking-FP8

Image-Text-to-Text • 9B • Updated Nov 26, 2025 • 36.3k • 30

about 7 hours ago

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Paper • 2512.02835 • Published Dec 2, 2025 • 9
Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Paper • 2512.05044 • Published Dec 4, 2025 • 16
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

Paper • 2512.05591 • Published Dec 5, 2025 • 16
SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Paper • 2512.05343 • Published Dec 5, 2025 • 24

Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Paper • 2512.02395 • Published Dec 2, 2025 • 47

Salesforce/blip2-opt-2.7b

Image-Text-to-Text • 4B • Updated Feb 3, 2025 • 491k • 427
timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 95.2k • 1.16k
huggan/pix2pix-edge2shoes

Updated Apr 15, 2022 • 3
KaiChen1998/geodiffusion-nuimages-time-weather-512x512

Text-to-Image • Updated Dec 5, 2024 • 5

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 69
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148

PubTables-1M: Towards comprehensive table extraction from unstructured documents

Paper • 2110.00061 • Published Sep 30, 2021 • 3
Optimized Table Tokenization for Table Structure Recognition

Paper • 2305.03393 • Published May 5, 2023 • 1
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 148
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16, 2025 • 111

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs