🦋 Monarch-BERT-MNLI (Hybrid)

The Efficiency Sweet Spot: +12% Speed, <1% Accuracy Loss.

tl;dr: Achieving BERT performance with extreme resource efficiency on MNLI. We distilled BERT-Base into a Monarch-Hybrid in just 3 hours on one H100. Despite using only 500k Wiki tokens + MNLI data, this model delivers +12% throughput with near-perfect BERT accuracy.

💸 High Performance, Low Cost

Training models from scratch require training on billions of tokens. We took a different path to shock the efficiency curve:

Training Time: A few hours.
Hardware: 1x NVIDIA H100.
Data: Only MNLI (Task) + 500k Wikipedia Samples (Anchor).
Result: We kept 99% of the accuracy on MNLI. High performance, zero waste.

🚀 Key Benchmarks

Measured on a single NVIDIA H100 using torch.compile(mode="max-autotune").

Metric	BERT-Base (Baseline)	Monarch-BERT (This)	Delta
Parameters	110.0M	81.7M	📉 -25.6%
Compute (GFLOPs)	696.5	464.5	📉 -33.3%
Throughput (TPS)	7261	8119	🚀 +10.6%
Latency (Batch 32)	4.41 ms	3.94 ms	⚡ Faster

🛠️ Usage

This model uses a custom architecture. You must enable trust_remote_code=True to load the Monarch layers (MonarchUp, MonarchDown, MonarchFFN).

To see the real speedup, compilation is mandatory (otherwise PyTorch Python overhead masks the gains).

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "ykae/monarch-bert-base-mnli-hybrid"  

print(f"📦 Loading Hybrid Model: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    trust_remote_code=True
).to(device)

# torch.set_float32_matmul_precision('high')
# print("🔨 Compiling with torch.compile (max-autotune)...")
# model = torch.compile(model, mode="max-autotune")
model.eval()

print("📊 Loading MNLI Validation set...")
dataset = load_dataset("glue", "mnli", split="validation_matched")

def tokenize_fn(ex):
    return tokenizer(ex['premise'], ex['hypothesis'], 
                     padding="max_length", truncation=True, max_length=128)

tokenized_ds = dataset.map(tokenize_fn, batched=True)
tokenized_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
loader = DataLoader(tokenized_ds, batch_size=32)

correct = 0
total = 0

print(f"🚀 Starting evaluation on {len(tokenized_ds)} samples...")
with torch.no_grad():
    for batch in tqdm(loader):
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(ids, attention_mask=mask)
        preds = torch.argmax(outputs.logits, dim=1)
        
        correct += (preds == labels).sum().item()
        total += labels.size(0)

print(f"\n✅ Hybrid Evaluation Finished!")
print(f"📈 Accuracy: {100 * correct / total:.2f}%")

🧠 The "Memory Paradox" (Read this!)

You might notice that while the parameter count is lower (82M vs 110M), the peak VRAM usage during inference can be slightly higher than the baseline.

Why? This is a software artifact, not a hardware limitation.

Solution: A custom Fused Triton Kernel (planned) would fuse the steps of our Monarch, keeping intermediate activations in the GPU's SRAM. This would drop dynamic VRAM usage significantly below the baseline, matching the FLOPs reduction.

Citation

@misc{ykae-monarch-bert-mnli-hybrid-2026,
  author = {Yusuf Kalyoncuoglu, YKAE-Vision},
  title = {Monarch-BERT-MNLI-Hybrid: Balancing Efficiency and Accuracy via Hybrid Monarch FFNs},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/ykae/monarch-bert-base-mnli-hybrid}}
}

Downloads last month: 65

Safetensors

Model size

82.2M params

Tensor type

F32

Datasets used to train ykae/monarch-bert-base-mnli-hybrid

Evaluation results

Accuracy on GLUE MNLI
self-reported

83.600
Throughput (TPS on H100) on GLUE MNLI
self-reported

8119.800
Latency (ms) on GLUE MNLI
self-reported

3.940