MiniMax-M2.1-NVFP4

NVFP4 quantized version of MiniMaxAI/MiniMax-M2.1 for efficient inference on NVIDIA Blackwell GPUs.

Model Details

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
Architecture	Mixture of Experts (MoE)
Total Parameters	229B
Active Parameters	~45B (8 of 256 experts)
Quantization	NVFP4 (e2m1 format)
Size	131 GB

Quantization Details

Format: NVFP4 with two-level scaling (block-wise FP8 + global FP32)
Scheme: compressed-tensors with nvfp4-pack-quantized format
Target: All linear layers in attention and MoE experts
Group Size: 16

Requirements

NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.)
vLLM with flashinfer-cutlass NVFP4 support
~130 GB VRAM (TP=2 recommended for dual GPU setups)

Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="GadflyII/MiniMax-M2.1-NVFP4",
    tensor_parallel_size=2,
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    trust_remote_code=True,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Performance

Tested on 2x RTX PRO 6000 Blackwell (98GB each):

Prompt Tokens	Output Tokens	Throughput
~100	100	~73 tok/s
~1260	1000	~72 tok/s

License

Same as base model - see MiniMaxAI/MiniMax-M2.1 for details.

Acknowledgments

MiniMax for the original MiniMax-M2.1 model
vLLM team for NVFP4 quantization support

Downloads last month: 29

Model tree for GadflyII/MiniMax-M2.1-NVFP4

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(32)

this model