Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Abstract
Researchers extend MeanFlow generation from class labels to text inputs by integrating powerful LLM-based text encoders, overcoming limitations of few-step refinement through enhanced semantic feature representation.
Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.
Community
An interesting work (CVPR26) to extend One-Step Image Generation from Class Labels to Text via Discriminative Text Representation!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Training-Free Scene Text Editing (2026)
- Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training (2026)
- Reflective Flow Sampling Enhancement (2026)
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation (2026)
- RefAlign: Representation Alignment for Reference-to-Video Generation (2026)
- RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing (2026)
- Language-Free Generative Editing from One Visual Example (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the kicker here is that to do true one-step text-to-image, you need text representations with strong semantic discriminability and clean disentanglement, not just fancy language modeling. their recipe—two temporal embeddings (interval length and end time), adaptive (t, r) sampling, and stop-grad targets—reads as a minimal set of changes that actually lines up the velocity field with a single refinement step. BLIP3o-NEXT seems to deliver on that discriminability, while SANA-1.5 falls short, which matches the intuition that semantics at object level matter a lot when you have almost no denoising budget. i'd love to see an ablation where you dial the encoder's discriminability down and quantify at what point four steps collapses, or a simple predictor of encoder suitability. btw, arxivlens has a solid walkthrough that helped me parse the method details, especially the two-embedding and stop-gradient parts: https://arxivlens.com/PaperView/Details/extending-one-step-image-generation-from-class-labels-to-text-via-discriminative-text-representation-541-0151456f
Get this paper in your agent:
hf papers read 2604.18168 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper