Darwin-9B-MFP4

Mixed-Precision FP4 quantization of Darwin-9B-Opus, built on NVIDIA Blackwell-native NVFP4.

The first member of the Darwin Mixed-Precision family — quantization that respects what each layer actually does, instead of compressing everything uniformly.


🧬 Model Lineage

Qwen3.5 (Alibaba Qwen team — hybrid attention architecture)
    │
    ▼
Darwin-9B-Opus  (FINAL-Bench)
    │   evolutionary merge across the Qwen3.5 family
    │   Architecture: Qwen3_5ForConditionalGeneration
    │
    ▼
Darwin-9B-MFP4  ← this model
    Mixed-Precision FP4 via NVIDIA ModelOpt
    Targets MLP layers only; preserves attention pathways

💡 What is MFP4?

MFP4 (Mixed FP4) is a precision-allocation strategy, not a single bit-width. Different functional regions of the network get different precisions, chosen to match their role:

Region Precision Why
MLP / FFN layers NVFP4 (4-bit) Per-token compute — tolerant to controlled noise
Self-attention (Q/K/V/O) BF16 Long-range coordination — sensitive to rounding
Linear-attention blocks BF16 Stateful recurrent paths — must remain stable
LM head / Embeddings BF16 Direct I/O surface — no degradation acceptable
LayerNorms / scales BF16 Tiny, but critical scale factors

The bulk of parameters (the MLPs) move to FP4, while the small but architecturally critical attention/coordination paths stay full-precision.


🎯 Why mixed precision matters

Uniform quantization treats every weight the same. In practice, transformer layers have very different roles:

  • MLPs are local, parallel, and compute-heavy — they account for the majority of the parameter count and tolerate compression noise gracefully because each forward pass averages over many independent activations.
  • Attention is the model's coordination substrate. Even small perturbations there propagate across long contexts, fragmenting reasoning chains and causing decoding pathologies (looping, repetition, premature termination).

A uniform 4-bit quantization compresses all of these the same way and pays an attention-quality cost it didn't need to pay. MFP4 isolates the cost to the layers that can absorb it.

This aligns with the Darwin philosophy: let the architecture's structure dictate the optimization, rather than imposing a single recipe everywhere.


🚀 Why NVFP4 (and not just FP4)?

NVFP4 is NVIDIA's microblock 4-bit floating-point format with FP8-scaled groups of 16 elements.

  • Native hardware acceleration on Blackwell (B200, RTX 5090): NVFP4 GEMMs run on dedicated tensor cores at 2nd-generation FP4 throughput, with no software emulation in the hot path.
  • Higher numerical accuracy than INT4 at the same bit-width, thanks to the floating-point representation and per-block FP8 scales.
  • First-class support in vLLM (--quantization modelopt_fp4), TensorRT-LLM, and the broader NVIDIA inference stack.

Combined with MFP4's selective application, the result is FP4-class memory savings on the bulk of the model with BF16-quality attention behavior — and on Blackwell, FP4-class throughput on the dominant cost center (MLP GEMMs).


📦 Specs

Base model FINAL-Bench/Darwin-9B-Opus
Quantization scheme MFP4 (MLP → NVFP4, attention → BF16)
Disk size ≈ 11 GB (base BF16: 19 GB)
Architecture Qwen3.5 hybrid (full + linear attention)
Quantization tool NVIDIA ModelOpt
Inference runtime vLLM ≥ 0.19 with modelopt_fp4 backend

⚙️ Where MFP4 fits in the Darwin platform

The Darwin platform produces base models through evolutionary merging of open-source families. MFP4 is the first deployment-ready quantization in that lineage — designed so that the structural decisions made during evolution (which attention type lives where, which MLP carries which capability) are preserved when the model is compressed for serving.

In other words: Darwin's value isn't only in how the weights got there — it's also in making sure those weights still work when you halve the memory footprint. MFP4 is the bridge between research-grade BF16 checkpoints and Blackwell-grade serving infrastructure.

Future Darwin releases will share this serving stack: same NVFP4 format, same MLP-only allocation policy, same vLLM path.


🚀 Usage

vLLM (recommended)

pip install "vllm>=0.19" nvidia-modelopt

vllm serve FINAL-Bench/Darwin-9B-MFP4 \
    --quantization modelopt_fp4 \
    --trust-remote-code \
    --port 8000 \
    --enforce-eager \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85

OpenAI-compatible client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="FINAL-Bench/Darwin-9B-MFP4",
    messages=[{"role": "user", "content": "..."}],
    max_tokens=4096,
    temperature=0.0,
)

🖥️ Hardware

GPU family Status
Blackwell (B200, RTX 5090) ✅ Native NVFP4 tensor cores — best path
Hopper (H100/H200) ✅ FLASHINFER_CUTLASS NVFP4 path
Ada (L40, RTX 6000 Ada) ⚠️ Partial — depends on driver/runtime
Older Ampere/Volta ❌ NVFP4 unavailable

Minimum VRAM for inference: ~13 GB. Comfortable on a single 24 GB consumer card.


📍 When to use this model

Good fit:

  • Latency- and memory-constrained serving on Blackwell or Hopper hardware
  • Reasoning workloads where attention quality matters (multi-step deduction, long contexts)
  • Workloads currently bottlenecked by 9B-class BF16 memory footprints

Consider the BF16 base instead if:

  • You need bit-exact reproducibility against research baselines
  • Your hardware lacks NVFP4 support
  • You are doing further training / fine-tuning (quantize after, not before)

🙏 Credits


📜 License

Apache 2.0 (inherited from base model).

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FINAL-Bench/Darwin-9B-MFP4

Quantized
(4)
this model

Collection including FINAL-Bench/Darwin-9B-MFP4