YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

Mingwang Xu^1* Jiahao Cui^1* Feipeng Cai^2* Hanlin Shang^1* Zhihao Zhu¹ Shan Luan¹

Yifang Xu¹ Neng Zhang² Yaoyi Li² Jia Cai² Siyu Zhu¹

¹Fudan University ²Yinwang Intelligent Technology Co., Ltd

📰 News

2025/02/01: 🎉🎉🎉 Release the pretrained models on Huggingface.
2025/12/06: 🎉🎉🎉 Paper submitted on Arxiv.

📅️ Roadmap

Status	Milestone	ETA
✅	Release the inference source code	2025.12.21
✅	Release the SFT and inf code	2025.12.21
✅	Release pretrained models on Huggingface	2026.02.01
🚀	Release NAVSIM evaluation code	TBD
🚀	Release the RL code	TBD

🔧️ Framework

🏆 Qualitative Results on NAVSIM

NAVSIM-v1 benchmark results

NAVSIM-v2 benchmark results

Quick Inference Demo

The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:

Clone the repository

git clone https://github.com/fudan-generative-vision/WAM-Diff
cd WAM-Diff

Initialize the environment
If you prefer conda, run the environment setup script to install necessary dependencies:
```
bash init_env.sh
```
Or you can use uv to create the environment:
```
uv venv && uv sync
```
Prepare the Model Download the pretrained WAM-Diff model from Hugging Face to the ./model/WAM-Diff directory:
```
https://huggingface.co/fudan-generative-ai/WAM-Diff
```
Download the pretrained Siglip2 model from Hugging Face to the ./model/siglip2-so400m-patch14-384 directory:
```
https://huggingface.co/google/siglip2-so400m-patch14-384
```
Run the demo script
Execute the demo script to test WAM-Diff on an example image:
```
bash inf.sh
```

Training

To fine-tune WAM-Diff, please follow these steps:

Set Up the Environment
Follow the same environment setup steps as in the Quick Inference Demo section.

Prepare the Data
Prepare your training dataset in JSON format like

[
    {
    "image": ["path/to/image1.png"],
    "conversations": [
        {
            "from": "human",
            "value": "Here is front views of a driving vehicle:\n<image>\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29)  and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
        },
        {
            "from": "gpt",
            "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
        }
        ]
    },
    ...
]

Run the Training Script
Execute the training script with the following command:
```
cd train
bash ./scripts/llada_v_finetune.sh
```

📝 Citation

If you find our work useful for your research, please consider citing the paper:

@article{xu2025wam,
  title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
  author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
  journal={arXiv preprint arXiv:2512.11872},
  year={2025}
}

🤗 Acknowledgements

We gratefully acknowledge the contributors to the LLaDA-V, repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.

Downloads last month: 11

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fudan-generative-ai/WAM-Diff

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

Paper • 2512.11872 • Published Dec 6, 2025