MiniMax-M2.1-NVFP4

NVFP4 quantized version of MiniMaxAI/MiniMax-M2.1 for efficient inference on NVIDIA Blackwell GPUs.

Model Details

Property Value
Base Model MiniMaxAI/MiniMax-M2.1
Architecture Mixture of Experts (MoE)
Total Parameters 229B
Active Parameters ~45B (8 of 256 experts)
Quantization NVFP4 (e2m1 format)
Size 131 GB

Quantization Details

  • Format: NVFP4 with two-level scaling (block-wise FP8 + global FP32)
  • Scheme: compressed-tensors with nvfp4-pack-quantized format
  • Target: All linear layers in attention and MoE experts
  • Group Size: 16

Requirements

  • NVIDIA Blackwell GPU (RTX 5090, RTX PRO 6000, etc.)
  • vLLM with flashinfer-cutlass NVFP4 support
  • ~130 GB VRAM (TP=2 recommended for dual GPU setups)

Usage with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="GadflyII/MiniMax-M2.1-NVFP4",
    tensor_parallel_size=2,
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    trust_remote_code=True,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

Performance

Tested on 2x RTX PRO 6000 Blackwell (98GB each):

Prompt Tokens Output Tokens Throughput
~100 100 ~73 tok/s
~1260 1000 ~72 tok/s

License

Same as base model - see MiniMaxAI/MiniMax-M2.1 for details.

Acknowledgments

  • MiniMax for the original MiniMax-M2.1 model
  • vLLM team for NVFP4 quantization support
Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GadflyII/MiniMax-M2.1-NVFP4

Quantized
(32)
this model