Training strategies of Z-Image-Turbo
Introduction
Tongyi-MAI/Z-Image-Turbo has made to top of trending model list of popular open-model communities including both Hugging Face and ModelScope. One of the most laudable feature of the model by the community is that, being a distilled version, it is capable of generating high-quality images within few steps. However, it also means that the model can be tricky to train, especially if we want to retain the "turbo" capability in the LoRA, for fast image-generation.
In this blog, we describe our efforts to facilitate proper LoRA training of Z-Image-Turbo. In particular, we compare different training options, and propose a augmented training solution that allow us to rely on the plug-and-play standardized SFT procedure, without impacting the Turbo acceleration at inference time. To achieve this, we leverage a Z-Image-Turbo-DistillPatch LoRA which we are making openly-available.
Like many of you, we are waiting for the release of Z-Image-Base that would allow more straightforward training of Z-Image LoRAs. In the meantime, we believe that the augmented training solution proposed herein offers a viable solution for training Z-Image-Turbo. We are also working on launching it on the training service at ModelScope Civision. Stay tuned and happy training.
The Training Options
Tongyi-MAI/Z-Image-Turbo is an accelerated generation model based on distillation technology, with its core advantage being support for low-step inference.
Training Precautions: Directly updating the model weights (such as full fine-tuning or standard LoRA) tends to disrupt the model's pre-trained acceleration trajectory, leading to the following phenomena:
- When inferring with the default "acceleration configuration" (
num_inference_steps=8,cfg_scale=1), generation quality degrades significantly. - When inferring with the "no acceleration configuration" (
num_inference_steps=30,cfg_scale=2), generation quality actually improves, indicating that the model has degenerated into a non-Turbo version.
To address this issue, DiffSynth-Studio provides four training and inference combination strategies. You can select the most suitable scheme based on your requirements for inference speed and training costs.
Common Experimental Setup:
- Dataset:
- Training Steps: 5 epochs * 50 repeats = 250 steps
- Validation Prompt: "a dog"
Scheme 1: Standard SFT Training + No Acceleration Configuration Inference
This is the most general fine-tuning method. If you do not rely on the Turbo model's rapid inference capabilities and focus solely on post-fine-tuning generation quality, you can directly use the standard SFT script for training.
- Applicable Scenario: Insensitive to inference speed; seeking a simple training workflow.
- Training Method: Use standard SFT training.
accelerate launch examples/z_image/model_training/train.py \
--dataset_base_path data/example_image_dataset \
--dataset_metadata_path data/example_image_dataset/metadata.csv \
--max_pixels 1048576 \
--dataset_repeat 50 \
--model_id_with_origin_paths "Tongyi-MAI/Z-Image-Turbo:transformer/*.safetensors,Tongyi-MAI/Z-Image-Turbo:text_encoder/*.safetensors,Tongyi-MAI/Z-Image-Turbo:vae/diffusion_pytorch_model.safetensors" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/Z-Image-Turbo_lora" \
--lora_base_model "dit" \
--lora_target_modules "to_q,to_k,to_v,to_out.0,w1,w2,w3" \
--lora_rank 32 \
--use_gradient_checkpointing \
--dataset_num_workers 8+ **Inference Configuration:** You must abandon the acceleration configuration. Please adjust to `num_inference_steps=30` and `cfg_scale=2`.
Results after each epoch (8-steps, cfg=1):
Final result (30 steps, cfg=2):
Scheme 2: Differential LoRA Training + Acceleration Configuration Inference
If you wish the fine-tuned model to retain its 8-step generation acceleration capability, Differential LoRA training is recommended. This method locks the acceleration trajectory by introducing a preset LoRA.
- Applicable Scenario: Requires maintaining 8-step rapid inference with low VRAM usage.
- Training Method: Perform Differential LoRA training by loading a preset LoRA, e.g., ostris/zimage_turbo_training_adapter.
- Inference Configuration: Maintain the acceleration configuration, i.e.,
num_inference_steps=8andcfg_scale=1.
Final result (8 steps, cfg=1):
Scheme 3: Standard SFT Training + Trajectory Imitation Distillation Training + Acceleration Configuration Inference
This is a two-stage "fine-tune first, accelerate later" training scheme, aiming to let the model learn content first and then recover speed.
- Applicable Scenario: Requires standard SFT training and the recovery of acceleration capabilities.
- Training Method: First, execute the standard SFT training from Scheme 1 (at which point acceleration capability will be lost); subsequently, perform Trajectory Imitation Distillation training based on the SFT model.
- Inference Configuration: Restore the acceleration configuration, i.e.,
num_inference_steps=8andcfg_scale=1.
Final result (8 steps, cfg=1):
Scheme 4: Standard SFT Training + Loading Distillation Acceleration LoRA during Inference + Acceleration Configuration Inference
This scheme uses standard SFT for training and utilizes an external module (https://www.modelscope.cn/models/DiffSynth-Studio/Z-Image-Turbo-DistillPatch) during inference to recover acceleration capabilities.
- Applicable Scenario: Wishing to use the standard SFT workflow, or already possessing a trained SFT model and hoping to restore its acceleration characteristics without re-training.
- Training Method: Execute the standard SFT training from Scheme 1.
- Inference Method: Additionally load the Distillation Acceleration LoRA and use the acceleration configuration of
num_inference_steps=8andcfg_scale=1.
Final result (8 steps, cfg=1):
Conclusion
Comparison
| Scheme | Training Approach | Inference Config | Advantages | Disadvantages | Best For |
|---|---|---|---|---|---|
| 1 | Standard SFT (full or LoRA) | num_inference_steps=30, cfg_scale=2 (non-accelerated) |
• Simple and familiar workflow • High generation quality under slow inference |
• Loses Turbo acceleration • Cannot generate quality images in 8 steps • Slower inference negates model's core advantage |
Users prioritizing output quality over speed; no need for fast inference |
| 2 | Differential LoRA (with preset adapter, e.g., ostris/zimage_turbo_training_adapter) |
num_inference_steps=8, cfg_scale=1 (accelerated) |
• Preserves 8-step acceleration • Low VRAM usage • Good speed/customization balance |
• Requires specific preset LoRA • Less flexible for complex or novel concepts • Potential domain misalignment |
Lightweight customization where Turbo speed is essential |
| 3 | Two-stage: (1) Standard SFT → (2) Trajectory Imitation Distillation | num_inference_steps=8, cfg_scale=1 (accelerated) |
• High-quality domain adaptation and restored acceleration • Robust 8-step performance |
• Complex two-phase training • Higher computational cost and time • Requires careful distillation tuning |
High-fidelity applications needing both customization and speed, with ample resources |
| 4 | Standard SFT only | num_inference_steps=8, cfg_scale=1 + load external DistillPatch LoRA at inference |
• Uses simple, standard SFT • No retraining needed • Instantly recovers Turbo speed via plug-in LoRA • Works with existing SFT models |
• Requires loading an extra LoRA at inference (minimal overhead) • Depends on external module availability |
Most users—ideal for flexibility, efficiency, and maintaining both quality and speed |
Recommendation: Use Scheme 4
Scheme 4 offers the best trade-off: you keep the simplicity and power of standard SFT while effortlessly restoring Turbo acceleration at inference time by loading the official Z-Image-Turbo-DistillPatch LoRA. This plug-and-play approach avoids retraining, supports existing models, and delivers high-quality 8-step generation, making it the most practical and scalable choice.


















