Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Abstract
A data-efficient training framework for unified multimodal models that uses image-only data for pre-training followed by fine-tuning with mixed data types achieves state-of-the-art performance with reduced computational requirements.
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.
Community
IOMM (Image-Only Training for UMMs) introduces a data-efficient two-stage framework that achieves state-of-the-art multimodal generation by replacing the costly reliance on paired text-image data with a high-performance "image-only" pre-training stage.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing (2026)
- Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device (2026)
- VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters (2026)
- Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning (2026)
- CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models (2026)
- Diff-Aid: Inference-time Adaptive Interaction Denoising for Rectified Text-to-Image Generation (2026)
- Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper