πŸ¦‹ Monarch-BERT-MNLI (Hybrid)

The Efficiency Sweet Spot: +12% Speed, <1% Accuracy Loss.

tl;dr: Achieving BERT performance with extreme resource efficiency on MNLI. We distilled BERT-Base into a Monarch-Hybrid in just 3 hours on one H100. Despite using only 500k Wiki tokens + MNLI data, this model delivers +12% throughput with near-perfect BERT accuracy.

πŸ’Έ High Performance, Low Cost

Training models from scratch require training on billions of tokens. We took a different path to shock the efficiency curve:

  • Training Time: A few hours.
  • Hardware: 1x NVIDIA H100.
  • Data: Only MNLI (Task) + 500k Wikipedia Samples (Anchor).
  • Result: We kept 99% of the accuracy on MNLI. High performance, zero waste.

πŸš€ Key Benchmarks

Measured on a single NVIDIA H100 using torch.compile(mode="max-autotune").

Metric BERT-Base (Baseline) Monarch-BERT (This) Delta
Parameters 110.0M 81.7M πŸ“‰ -25.6%
Compute (GFLOPs) 696.5 464.5 πŸ“‰ -33.3%
Throughput (TPS) 7261 8119 πŸš€ +10.6%
Latency (Batch 32) 4.41 ms 3.94 ms ⚑ Faster

πŸ› οΈ Usage

This model uses a custom architecture. You must enable trust_remote_code=True to load the Monarch layers (MonarchUp, MonarchDown, MonarchFFN).

To see the real speedup, compilation is mandatory (otherwise PyTorch Python overhead masks the gains).

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "ykae/monarch-bert-base-mnli-hybrid"  

print(f"πŸ“¦ Loading Hybrid Model: {model_id}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    trust_remote_code=True
).to(device)

# torch.set_float32_matmul_precision('high')
# print("πŸ”¨ Compiling with torch.compile (max-autotune)...")
# model = torch.compile(model, mode="max-autotune")
model.eval()

print("πŸ“Š Loading MNLI Validation set...")
dataset = load_dataset("glue", "mnli", split="validation_matched")

def tokenize_fn(ex):
    return tokenizer(ex['premise'], ex['hypothesis'], 
                     padding="max_length", truncation=True, max_length=128)

tokenized_ds = dataset.map(tokenize_fn, batched=True)
tokenized_ds.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
loader = DataLoader(tokenized_ds, batch_size=32)

correct = 0
total = 0

print(f"πŸš€ Starting evaluation on {len(tokenized_ds)} samples...")
with torch.no_grad():
    for batch in tqdm(loader):
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(ids, attention_mask=mask)
        preds = torch.argmax(outputs.logits, dim=1)
        
        correct += (preds == labels).sum().item()
        total += labels.size(0)

print(f"\nβœ… Hybrid Evaluation Finished!")
print(f"πŸ“ˆ Accuracy: {100 * correct / total:.2f}%")

🧠 The "Memory Paradox" (Read this!)

You might notice that while the parameter count is lower (82M vs 110M), the peak VRAM usage during inference can be slightly higher than the baseline.

Why? This is a software artifact, not a hardware limitation.

  • Solution: A custom Fused Triton Kernel (planned) would fuse the steps of our Monarch, keeping intermediate activations in the GPU's SRAM. This would drop dynamic VRAM usage significantly below the baseline, matching the FLOPs reduction.

Citation

@misc{ykae-monarch-bert-mnli-hybrid-2026,
  author = {Yusuf Kalyoncuoglu, YKAE-Vision},
  title = {Monarch-BERT-MNLI-Hybrid: Balancing Efficiency and Accuracy via Hybrid Monarch FFNs},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/ykae/monarch-bert-base-mnli-hybrid}}
}
Downloads last month
65
Safetensors
Model size
82.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train ykae/monarch-bert-base-mnli-hybrid

Evaluation results