Instructions to use arcee-ai/Trinity-Mini-FP8-Block with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use arcee-ai/Trinity-Mini-FP8-Block with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="arcee-ai/Trinity-Mini-FP8-Block", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Trinity-Mini-FP8-Block", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("arcee-ai/Trinity-Mini-FP8-Block", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use arcee-ai/Trinity-Mini-FP8-Block with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "arcee-ai/Trinity-Mini-FP8-Block"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Trinity-Mini-FP8-Block",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/arcee-ai/Trinity-Mini-FP8-Block

SGLang

How to use arcee-ai/Trinity-Mini-FP8-Block with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "arcee-ai/Trinity-Mini-FP8-Block" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Trinity-Mini-FP8-Block",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "arcee-ai/Trinity-Mini-FP8-Block" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "arcee-ai/Trinity-Mini-FP8-Block",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use arcee-ai/Trinity-Mini-FP8-Block with Docker Model Runner:
```
docker model run hf.co/arcee-ai/Trinity-Mini-FP8-Block
```

Trinity Mini FP8-Block

This repository contains the FP8 block-quantized weights of Trinity-Mini (FP8 weights and activations with per-block scaling).

Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.

This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.

Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.

Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.

More details, including key architecture decisions, can be found on our blog here

Try it out now at chat.arcee.ai

Model Details

Model Architecture: AfmoeForCausalLM
Parameters: 26B, 3B active
Experts: 128 total, 8 active, 1 shared
Context length: 128k
Training Tokens: 10T
License: OpenMDW-1.1
Recommended settings:
- temperature: 0.15
- top_k: 50
- top_p: 0.75
- min_p: 0.06

Quantization Details

Scheme: FP8 Block (FP8 weights and activations, per-block scaling with E8M0 scale format)
Format: compressed-tensors
Intended use: High-throughput FP8 deployment of Trinity-Mini with near-lossless quality, optimized for NVIDIA Hopper GPUs
Supported backends: DeepGEMM, vLLM CUTLASS, Triton

Benchmarks

Running our model

VLLM

Supported in VLLM release 0.18.0+ with DeepGEMM FP8 MoE acceleration.

# pip
pip install "vllm>=0.18.0"

Serving the model with DeepGEMM enabled:

VLLM_USE_DEEP_GEMM=1 vllm serve arcee-ai/Trinity-Mini-FP8-Block \
  --trust-remote-code \
  --max-model-len 4096 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_r1 \
  --tool-call-parser hermes

Serving without DeepGEMM (falls back to CUTLASS/Triton):

vllm serve arcee-ai/Trinity-Mini-FP8-Block \
  --trust-remote-code \
  --max-model-len 4096 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_r1 \
  --tool-call-parser hermes

Transformers

Use the main transformers branch

git clone https://github.com/huggingface/transformers.git
cd transformers

# pip
pip install '.[torch]'

# uv
uv pip install '.[torch]'

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/Trinity-Mini-FP8-Block"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.15,
    top_k=50,
    top_p=0.75,
    min_p=0.06
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

API

Trinity Mini is available today on openrouter:

https://openrouter.ai/arcee-ai/trinity-mini

curl -X POST "https://openrouter.ai/v1/chat/completions" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "arcee-ai/trinity-mini",
    "messages": [
      {
        "role": "user",
        "content": "What are some fun things to do in New York?"
      }
    ]
  }'