Arcee Trinity Mini

Trinity Mini NVFP4

This repository contains the NVFP4 quantized weights of Trinity-Mini for deployment on NVIDIA Blackwell GPUs.

Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.

This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.


Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.

Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.

More details, including key architecture decisions, can be found on our blog here


Model Details

  • Model Architecture: AfmoeForCausalLM
  • Parameters: 26B, 3B active
  • Experts: 128 total, 8 active, 1 shared
  • Context length: 128k
  • Training Tokens: 10T
  • License: Apache 2.0
  • Recommended settings:
    • temperature: 0.15
    • top_k: 50
    • top_p: 0.75
    • min_p: 0.06

Benchmarks

Powered by Datology

Quantization Details

  • Scheme: NVFP4 (nvfp4_mlp_only — MLP/expert weights only, attention remains BF16)
  • Tool: NVIDIA ModelOpt
  • Calibration: 512 samples, seq_length=2048, all-expert calibration enabled
  • KV cache: Not quantized

Running with vLLM

Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.

Blackwell GPUs (B200/B300/GB300) — Docker (recommended)

docker run --runtime nvidia --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.0-cu130 \
  arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Hopper GPUs (H100/H200) and others

vllm serve arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Note (Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:

export VLLM_NVFP4_GEMM_BACKEND=marlin

vllm serve arcee-ai/Trinity-Mini-NVFP4 \
  --trust-remote-code \
  --moe-backend marlin \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.

License

Trinity-Mini-NVFP4 is released under the Apache-2.0 license.

Downloads last month
116
Safetensors
Model size
14B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arcee-ai/Trinity-Mini-NVFP4

Quantized
(18)
this model

Collection including arcee-ai/Trinity-Mini-NVFP4