MIRO (main)
Qualitative samples from the released MIRO checkpoint — same gallery as the teaser of the project page.
Main MIRO checkpoint. Trained jointly on all seven reward signals (CLIP, aesthetic, ImageReward, PickScore, HPSv2, VQAScore, SciScore) with a 50/50 mix of original and synthetic captions.
This checkpoint accompanies the paper MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency (Dufour, Degeorge, Ghosh, Kalogeiton, Picard — ICML 2026).
| Paper | https://arxiv.org/abs/2510.25897 |
| Project page | https://nicolas-dufour.github.io/miro/ |
| Code | https://github.com/nicolas-dufour/miro |
| Parameters | 360.4M |
| Resolution | 256×256 (SDXL VAE latent space) |
| Architecture | RIN flow-matching backbone, FLAN-T5-XL text conditioning |
| Training data | CC12M + LAION Aesthetics v2 4.5 (6.0+ aesthetic subset) |
| Reward signals | clip_score, aesthetic_score, image_reward_score, pick_a_score_score, hpsv2_score, vqa_score, sciscore_score |
| Weights | model.safetensors, fp32 (EMA master weights — ready for finetuning) |
Install
pip install miro-t2i
miro-t2i is the public PyPI package; it imports as import miro. The first
call to MiroPipeline.from_pretrained(...) will additionally fetch
google/flan-t5-xl (text encoder)
and stabilityai/sdxl-vae
(latent decoder) from the Hub.
Usage
import torch
from miro import MiroPipeline
pipe = MiroPipeline.from_pretrained("nicolas-dufour/miro")
pipe = pipe.to("cuda", torch.float16)
prompt = (
"Photography closeup portrait of an adorable rusty brokendown steampunk "
"robot covered in budding vegetation, surrounded by tall grass, misty "
"futuristic scifi forest environment."
)
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.0)[0]
image.save("out.png")
Reward conditioning
MIRO conditions the flow model on a vector of reward targets in addition to the
text prompt. By default every reward is requested at its maximum (1.0); you
can override individual axes to bias generation toward a particular trade-off:
image = pipe(
prompt, # the rusty-robot prompt from above
reward_targets={
"clip_score": 1.0, # strict prompt alignment
"aesthetic_score": 0.3, # de-prioritise prettiness
"image_reward_score": 1.0, # prioritise general human preference
# any reward not listed defaults to 1.0
},
negative_reward_targets={
# zeros by default; what to push the unconditional branch toward
},
guidance_scale=7.0,
)[0]
The seven reward dimensions are:
| Reward | Normalised range | What it measures |
|---|---|---|
clip_score |
~[0, 1] | CLIP text–image alignment |
aesthetic_score |
~[0, 1] | LAION aesthetic-quality predictor |
image_reward_score |
~[0, 1] | ImageReward (general preference model) |
pick_a_score_score |
~[0, 1] | PickScore (human preference) |
hpsv2_score |
~[0, 1] | HPSv2 (human preference v2) |
vqa_score |
~[0, 1] | VQAScore (compositional faithfulness) |
sciscore_score |
~[0, 1] | SciScore (scientific-image plausibility) |
Reported benchmarks
The paper reports the following headline numbers for the main MIRO model
(this repo's nicolas-dufour/miro):
| Metric | MIRO (350M) | FLUX-dev (12B) |
|---|---|---|
| GenEval (overall) | 75 (with inference-time reward tuning) / 68 (default) | 67 |
| Inference compute | 1× | ~370× |
| Aesthetic-metric convergence vs. baseline pretraining | 19× faster | — |
Per-variant scores (GenEval, FID, individual reward scores) for the eight ablations are reported in the paper's ablation tables. Please refer to arXiv:2510.25897 for the full breakdown.
Training compute and data
- Default hardware: 2 nodes × 8 H100 GPUs (16× H100,
16-mixedprecision) - Optimiser: LAMB, lr 1e-3 (5k warmup → cosine decay), weight decay 1e-2
- Batch size: 1024 globally (64 per GPU on 16× H100), gradient-clip 2.0
- Steps: 500 k (≈ ~29 epochs over the enriched training set)
- Wall-clock on 16× H100: ~52 hours (≈ 2.65 train it/s sustained)
- 8-GPU fallback: 1 node × 8 H100 with
trainer.accumulate_grad_batches=2, measured at ≈ 1.45 train it/s →96 hours (4 days) end-to-end. Requirestrainer.strategy.static_graph=falseandtrainer.strategy.find_unused_parameters=trueto play well with the self-conditioning skip in the loss; both flags are set automatically bymiro/slurm/launch_multicad_synth_8gpu.py. - Data: CC12M +
LAION Aesthetics v2 4.5
filtered to
aesthetic_score >= 6.0(the higher-quality subset), encoded to SDXL VAE latents at 256 resolution. Each sample is paired with seven reward scores and FLAN-T5-XL embeddings of both the original and a synthetic caption, computed bymiro/data/preprocess_data.py.
Limitations and intended use
This checkpoint is a research artifact released to reproduce and build on the MIRO paper. Known limitations:
- Resolution: 256×256 only. Higher-resolution outputs require upscaling.
- Domain: trained on web-scraped image–caption pairs (CC12M + LAION Aesthetics 6.0). Inherits the biases of those datasets — including under-representation of many cultures, languages, and concepts, and the presence of stereotypes. Generations may reflect or amplify these biases.
- Reward-model biases: the seven reward predictors used during training encode their own biases (e.g. aesthetic and human-preference models reflect the taste of their annotator pools). Conditioning on these rewards inherits and can sharpen those biases.
- Not for safety-critical use: outputs are not factual and the SciScore reward does not guarantee scientific accuracy.
- No safety filter is shipped with the model; users deploying it in user-facing settings should add their own.
The model is released under the MIT license; the SDXL VAE and FLAN-T5-XL
encoder it depends on at inference time are loaded from
stabilityai/sdxl-vae and
google/flan-t5-xl and are
subject to their respective licenses.
Citation
@inproceedings{dufour2026miro,
title = {{MIRO}: {M}ult{I}-{R}eward c{O}nditioned pretraining improves {T2I} quality and efficiency},
author = {Dufour, Nicolas and Degeorge, Lucas and Ghosh, Arijit and Kalogeiton, Vicky and Picard, David},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026}
}
License
MIT — see https://github.com/nicolas-dufour/miro/blob/main/LICENSE.
- Downloads last month
- 40
