Qwen3-VL-235B-A22B-Instruct-NVFP4

NVFP4 (4-bit floating point) quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct, optimized for NVIDIA Blackwell GPUs.

Model Details

Property	Value
Base Model	Qwen/Qwen3-VL-235B-A22B-Instruct
Parameters	235B (22B active per token)
Architecture	MoE (Mixture of Experts) + Vision
Quantization	NVFP4 (weights & activations)
Original Size	~471 GB (BF16)
Quantized Size	~127 GB
Compression	~3.7x

Quantization Details

Tool: llmcompressor
Scheme: NVFP4 (4-bit floating point with block-wise scaling)
Calibration: 512 samples, 4096 sequence length
Pipeline: Sequential (layer-by-layer for memory efficiency)
Preserved layers: embeddings, lm_head, vision encoder, MoE gates/routers

Hardware Requirements

Minimum VRAM: ~130 GB (TP=2 recommended)
Optimized for: NVIDIA Blackwell (SM120) - RTX 5090, RTX PRO 6000
Also works on: NVIDIA Hopper (SM90) - H100, H200

Usage with vLLM

# Environment variables for Blackwell GPUs
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASH_ATTN_VERSION=2

# Serve with tensor parallelism
vllm serve GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 32768

Performance

Tested on 2x NVIDIA RTX PRO 6000 Blackwell (192GB VRAM total) with vLLM 0.13.0:

Metric	Value
Single Request Decode	57 tokens/s
Batch Throughput (8 req)	277 tokens/s
Memory Usage	~128 GB (64 GB/GPU)
Model Load Time	~30s

Acknowledgments

Original model by Qwen Team
Quantization using vLLM llmcompressor

License

Apache 2.0 (same as base model)

Downloads last month: 11

Safetensors

Model size

133B params

Tensor type

BF16

F32

F8_E4M3

Model tree for GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-235B-A22B-Instruct

Quantized

(22)

this model