Trinity Mini NVFP4

This repository contains the NVFP4 quantized weights of Trinity-Mini for deployment on NVIDIA Blackwell GPUs.

Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.

This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.

Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.

Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.

More details, including key architecture decisions, can be found on our blog here

Model Details

Model Architecture: AfmoeForCausalLM
Parameters: 26B, 3B active
Experts: 128 total, 8 active, 1 shared
Context length: 128k
Training Tokens: 10T
License: Apache 2.0
Recommended settings:
- temperature: 0.15
- top_k: 50
- top_p: 0.75
- min_p: 0.06

Benchmarks

Quantization Details

Scheme: NVFP4 (nvfp4_mlp_only — MLP/expert weights only, attention remains BF16)
Tool: NVIDIA ModelOpt
Calibration: 512 samples, seq_length=2048, all-expert calibration enabled
KV cache: Not quantized

Running with vLLM

Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.

Blackwell GPUs (B200/B300/GB300) — Docker (recommended)

docker run --runtime nvidia --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.0-cu130 \
  arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Hopper GPUs (H100/H200) and others

vllm serve arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Note (Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:

export VLLM_NVFP4_GEMM_BACKEND=marlin

vllm serve arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --moe-backend marlin \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.