π¦ Monarch-BERT-MNLI (Hybrid)
The Efficiency Sweet Spot: +12% Speed, <1% Accuracy Loss.
tl;dr: Achieving BERT performance with extreme resource efficiency on MNLI. We distilled BERT-Base into a Monarch-Hybrid in just 3 hours on one H100. Despite using only 500k Wiki tokens + MNLI data, this model delivers +12% throughput with near-perfect BERT accuracy.
πΈ High Performance, Low Cost
Training models from scratch require training on billions of tokens. We took a different path to shock the efficiency curve:
- Training Time: A few hours.
- Hardware: 1x NVIDIA H100.
- Data: Only MNLI (Task) + 500k Wikipedia Samples (Anchor).
- Result: We kept 99% of the accuracy on MNLI. High performance, zero waste.
π Key Benchmarks
Measured on a single NVIDIA H100 using torch.compile(mode="max-autotune").
| Metric | BERT-Base (Baseline) | Monarch-BERT (This) | Delta |
|---|---|---|---|
| Parameters | 110.0M | 81.7M | π -25.6% |
| Compute (GFLOPs) | 696.5 | 464.5 | π -33.3% |
| Throughput (TPS) | 7261 | 8119 | π +10.6% |
| Latency (Batch 32) | 4.41 ms | 3.94 ms | β‘ Faster |
π οΈ Usage
This model uses a custom architecture. You must enable trust_remote_code=True to load the Monarch layers (MonarchUp, MonarchDown, MonarchFFN).
To see the real speedup, compilation is mandatory (otherwise PyTorch Python overhead masks the gains).
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "ykae/monarch-bert-base-mnli-hybrid"
print(f"π¦ Loading Hybrid Model: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
trust_remote_code=True
).to(device)
# torch.set_float32_matmul_precision('high')
# print("π¨ Compiling with torch.compile (max-autotune)...")
# model = torch.compile(model, mode="max-autotune")
model.eval()
print("π Loading MNLI Validation set...")
dataset = load_dataset("glue", "mnli", split="validation_matched")
def tokenize_fn(ex):
return tokenizer(ex['premise'], ex['hypothesis'],
padding="max_length", truncation=True, max_length=128)
tokenized_ds = dataset.map(tokenize_fn, batched=True)
tokenized_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
loader = DataLoader(tokenized_ds, batch_size=32)
correct = 0
total = 0
print(f"π Starting evaluation on {len(tokenized_ds)} samples...")
with torch.no_grad():
for batch in tqdm(loader):
ids = batch['input_ids'].to(device)
mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(ids, attention_mask=mask)
preds = torch.argmax(outputs.logits, dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
print(f"\nβ
Hybrid Evaluation Finished!")
print(f"π Accuracy: {100 * correct / total:.2f}%")
π§ The "Memory Paradox" (Read this!)
You might notice that while the parameter count is lower (82M vs 110M), the peak VRAM usage during inference can be slightly higher than the baseline.
Why? This is a software artifact, not a hardware limitation.
- Solution: A custom Fused Triton Kernel (planned) would fuse the steps of our Monarch, keeping intermediate activations in the GPU's SRAM. This would drop dynamic VRAM usage significantly below the baseline, matching the FLOPs reduction.
Citation
@misc{ykae-monarch-bert-mnli-hybrid-2026,
author = {Yusuf Kalyoncuoglu, YKAE-Vision},
title = {Monarch-BERT-MNLI-Hybrid: Balancing Efficiency and Accuracy via Hybrid Monarch FFNs},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/ykae/monarch-bert-base-mnli-hybrid}}
}
- Downloads last month
- 65
Datasets used to train ykae/monarch-bert-base-mnli-hybrid
Evaluation results
- Accuracy on GLUE MNLIself-reported83.600
- Throughput (TPS on H100) on GLUE MNLIself-reported8119.800
- Latency (ms) on GLUE MNLIself-reported3.940