PerceptionDLM-Base

PerceptionDLM-Base is a strong open multimodal diffusion language model (DLM) that extends a large language diffusion backbone (LLaDA-8B) to visual instruction tuning. It establishes a new state-of-the-art baseline among open discrete-diffusion VLMs, outperforming LLaDA-V on 15 / 16 standard multimodal benchmarks while remaining competitive with same-scale autoregressive (AR) VLMs.

It serves as the foundation model for PerceptionDLM, our parallel region-perception model.

📄 Paper  |  💻 Code  |  🤗 Model Collection

Highlights

  • 🧠 Diffusion-based VLM. Non-autoregressive masked-denoising generation with intrinsic token-level parallelism.
  • 🏗️ LLaVA-style architecture. SigLIP-2 vision encoder + 2-layer MLP connector + LLaDA-8B diffusion decoder, with dynamic-resolution tiling for high-resolution inputs.
  • 🏆 Strong baseline. Outperforms LLaDA-V on 15/16 benchmarks; especially strong on fine-grained perception and hallucination robustness.

Model Details

Vision encoder google/siglip2-so400m-patch16-512 (frozen)
Connector 2-layer MLP with GELU
Language backbone LLaDA-Instruct-8B (diffusion)
Parameters ~8B
Training 4-stage visual instruction tuning, 32× H100 (~3 weeks)
Precision bfloat16

Results

PerceptionDLM-Base vs. open diffusion / AR VLMs (selected benchmarks):

Benchmark PerceptionDLM-Base LLaDA-V Qwen2.5-VL-7B InternVL3-8B
MMBench 85.0 82.9 83.5 83.4
SeedBench 78.9 74.8 77.0 77.1
ChartQA 91.6 78.3 86.2 86.6
MMVP 82.0 76.7 73.3 80.0
BLINK 60.3 50.9 55.3 55.5
RealWorldQA 73.7 63.2 68.4 70.8
HallusionBench 58.4 50.9 51.9 49.9

See the paper for the full 16-benchmark comparison.

Usage

Full inference scripts are provided in the GitHub repository.

python demo/infer_dmllm.py \
  --model-path MSALab/PerceptionDLM-Base \
  --image assets/demo.jpg \
  --prompt "What color shirt is the man in the picture wearing?" \
  --gen-length 64 --block-length 64 --steps 64
import torch
from transformers import AutoModel, AutoProcessor

model_path = "MSALab/PerceptionDLM-Base"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()
# See demo/infer_dmllm.py for the full preprocessing + generation pipeline.

Citation

@article{sun2026perceptiondlm,
  title   = {PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models},
  author  = {Sun, Yueyi and Wang, Yuhao and Li, Jason and Tian, Ye and Zhang, Tao and Mai, Jacky and Wang, Yihan and Wang, Haochen and Bai, Jinbin and Yang, Ling and Tong, Yunhai},
  journal = {arXiv preprint arXiv:2606.19534},
  year    = {2026}
}

License

Released under the Apache License 2.0.

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MSALab/PerceptionDLM-Base

Finetunes
1 model

Collection including MSALab/PerceptionDLM-Base

Paper for MSALab/PerceptionDLM-Base