Qwen3-VL-235B-A22B-Instruct-NVFP4

NVFP4 (4-bit floating point) quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct, optimized for NVIDIA Blackwell GPUs.

Model Details

Property Value
Base Model Qwen/Qwen3-VL-235B-A22B-Instruct
Parameters 235B (22B active per token)
Architecture MoE (Mixture of Experts) + Vision
Quantization NVFP4 (weights & activations)
Original Size ~471 GB (BF16)
Quantized Size ~127 GB
Compression ~3.7x

Quantization Details

  • Tool: llmcompressor
  • Scheme: NVFP4 (4-bit floating point with block-wise scaling)
  • Calibration: 512 samples, 4096 sequence length
  • Pipeline: Sequential (layer-by-layer for memory efficiency)
  • Preserved layers: embeddings, lm_head, vision encoder, MoE gates/routers

Hardware Requirements

  • Minimum VRAM: ~130 GB (TP=2 recommended)
  • Optimized for: NVIDIA Blackwell (SM120) - RTX 5090, RTX PRO 6000
  • Also works on: NVIDIA Hopper (SM90) - H100, H200

Usage with vLLM

# Environment variables for Blackwell GPUs
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASH_ATTN_VERSION=2

# Serve with tensor parallelism
vllm serve GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --max-model-len 32768

Performance

Tested on 2x NVIDIA RTX PRO 6000 Blackwell (192GB VRAM total) with vLLM 0.13.0:

Metric Value
Single Request Decode 57 tokens/s
Batch Throughput (8 req) 277 tokens/s
Memory Usage ~128 GB (64 GB/GPU)
Model Load Time ~30s

Acknowledgments

License

Apache 2.0 (same as base model)

Downloads last month
11
Safetensors
Model size
133B params
Tensor type
BF16
F32
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4

Quantized
(22)
this model