Qwen3-VL-235B-A22B-Instruct-NVFP4
NVFP4 (4-bit floating point) quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct, optimized for NVIDIA Blackwell GPUs.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-235B-A22B-Instruct |
| Parameters | 235B (22B active per token) |
| Architecture | MoE (Mixture of Experts) + Vision |
| Quantization | NVFP4 (weights & activations) |
| Original Size | ~471 GB (BF16) |
| Quantized Size | ~127 GB |
| Compression | ~3.7x |
Quantization Details
- Tool: llmcompressor
- Scheme: NVFP4 (4-bit floating point with block-wise scaling)
- Calibration: 512 samples, 4096 sequence length
- Pipeline: Sequential (layer-by-layer for memory efficiency)
- Preserved layers: embeddings, lm_head, vision encoder, MoE gates/routers
Hardware Requirements
- Minimum VRAM: ~130 GB (TP=2 recommended)
- Optimized for: NVIDIA Blackwell (SM120) - RTX 5090, RTX PRO 6000
- Also works on: NVIDIA Hopper (SM90) - H100, H200
Usage with vLLM
# Environment variables for Blackwell GPUs
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1
export VLLM_FLASH_ATTN_VERSION=2
# Serve with tensor parallelism
vllm serve GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4 \
--tensor-parallel-size 2 \
--trust-remote-code \
--max-model-len 32768
Performance
Tested on 2x NVIDIA RTX PRO 6000 Blackwell (192GB VRAM total) with vLLM 0.13.0:
| Metric | Value |
|---|---|
| Single Request Decode | 57 tokens/s |
| Batch Throughput (8 req) | 277 tokens/s |
| Memory Usage | ~128 GB (64 GB/GPU) |
| Model Load Time | ~30s |
Acknowledgments
- Original model by Qwen Team
- Quantization using vLLM llmcompressor
License
Apache 2.0 (same as base model)
- Downloads last month
- 11
Model tree for GadflyII/Qwen3-VL-235B-A22B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-235B-A22B-Instruct