Darwin-27B-Opus: Surpassing the Foundation Model Without Training — 86.9% on GPQA Diamond, Ranked 5th Globally
Qwen3.5-27B Dense | 27B Params | Thinking Mode | 262K Context | 201 Languages | BF16 | Apache 2.0
Zero training. Zero data. Single GPU. 2 hours. Surpassed the foundation model.
Abstract
Darwin-27B-Opus is a 27-billion-parameter language model produced entirely through evolutionary crossbreeding of pretrained models, requiring zero additional training, zero data, and a single GPU. On the GPQA Diamond benchmark — a graduate-level scientific reasoning evaluation comprising 198 expert-crafted questions in physics, chemistry, and biology — Darwin-27B-Opus achieves 86.9%, surpassing its progenitor Qwen3.5-27B (85.5%) by +1.4 percentage points and securing 5th place on the HuggingFace GPQA leaderboard.
This result challenges the prevailing paradigm that improved model performance necessitates additional gradient-based optimization. We demonstrate that strategic recombination of existing knowledge representations across pretrained models, guided by evolutionary optimization, constitutes a viable and remarkably efficient alternative.
GPQA Diamond Leaderboard (April 12, 2026)
| Rank | Model | Parameters | GPQA Diamond |
|---|---|---|---|
| 1 | TNSA/NGen-4-Pro | — | 91.1% |
| 2 | TNSA/NGen-4 | — | 90.1% |
| 3 | Qwen/Qwen3.5-397B-A17B | 397B | 88.4% |
| 4 | moonshotai/Kimi-K2.5 | — | 87.6% |
| 5 | FINAL-Bench/Darwin-27B-Opus | 27B | 86.9% |
| 6 | Qwen/Qwen3.5-122B-A10B | 122B | 86.6% |
| 7 | zai-org/GLM-5.1 | 744B | 86.2% |
| 8 | zai-org/GLM-5 | 744B | 86.0% |
| 9 | zai-org/GLM-4.7 | — | 85.7% |
| 10 | Qwen/Qwen3.5-27B | 27B | 85.5% |
A 27B model — produced without any training — surpasses GLM-5.1 (744B), Qwen3.5-122B (122B), and its own progenitor Qwen3.5-27B. This represents a parameter efficiency ratio exceeding 27× relative to GLM-5.1.
What Is Darwin?
Darwin is an evolutionary model breeding engine that crossbreeds the FFN (Feed-Forward Network) knowledge layers of pretrained AI models to automatically produce offspring that surpass both parents — with zero additional training.
Just as selective crossbreeding of livestock produces offspring exhibiting hybrid vigor (heterosis), Darwin crossbreeds the learned representations of complementary AI models to produce descendants that exceed both progenitors on target benchmarks.
Core Principle: FFN = Knowledge, Attention = Reasoning
Modern transformer-based language models consist of two principal computational modules:
- Attention — orchestrates information routing and constructs reasoning chains. The model's inferential architecture.
- FFN — stores factual knowledge and encodes learned patterns. The model's knowledge repository.
Darwin exploits this decomposition:
- FFN layers are transplantable between compatible models, enabling knowledge transfer without disrupting reasoning.
- Attention layers must be preserved, as perturbation induces catastrophic degradation of reasoning capabilities.
This principle is supported by recent theoretical work (arXiv:2501.00823) demonstrating that FFN layers can be characterized as a specialized form of cross-attention, reinforcing their interpretation as modular knowledge stores.
Parent Models
| Role | Model | Contribution |
|---|---|---|
| Father (Structure) | Qwen/Qwen3.5-27B | Foundation architecture, native reasoning, 201-language support |
| Mother (Knowledge) | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | Claude 4.6 Opus structured reasoning patterns via SFT distillation |
Both parents share identical architecture: hidden_size=4096, intermediate_size=17408, 64 layers — ensuring 100% structural compatibility for FFN crossbreeding.
Model MRI Diagnostic Scan
Left: Father (Qwen3.5-27B) — Broad, balanced activation across reasoning and knowledge domains. Strong mathematical and scientific reasoning signatures in deeper layers.
Right: Mother (Claude-4.6-Opus-Reasoning-Distilled) — Intensified reasoning concentration from Claude distillation. Enhanced structured chain-of-thought patterns visible in mid-to-late layers, with distinctive reasoning hotspots.
Evolution Process
Model MRI Scan — Darwin V6 performs comprehensive diagnostic analysis of both parents, profiling each layer's functional specialization across cognitive domains (reasoning, knowledge, language, mathematics).
CMA-ES Evolutionary Search — Covariance Matrix Adaptation Evolution Strategy optimizes per-block crossbreeding ratios across all 64 layers. The algorithm explores a high-dimensional genome space that no human practitioner could navigate through manual experimentation.
Health Check — Automated post-merge validation ensures the offspring model functions correctly.
Total compute: H100 × 1, approximately 2 hours.
Parent Layer-wise Comparison
This visualization illustrates the per-layer divergence between father and mother models. Regions of high divergence represent layers where CMA-ES must make critical allocation decisions — balancing the father's reasoning architecture against the mother's distilled knowledge patterns.
GPQA Diamond Evaluation
Methodology
We employed a two-pass evaluation protocol:
Pass 1 — Greedy Baseline
- All 198 questions, deterministic decoding (do_sample=False)
- Epoch AI standard prompt format
- Result: 148/198 = 74.7%
Pass 2 — Selective Retry with Verification
- 50 incorrectly answered questions only
- 8 independent stochastic generations per question (maj@8, temperature=0.7)
- Contested results (vote margin ≤ 1) trigger a verification round: top-2 candidates are presented for comparative analysis via greedy decoding
- Result: 24 additional corrections
Results by Shard
| Shard | Greedy | After Retry | Flipped | Gain |
|---|---|---|---|---|
| Shard 0 | 48/66 (72.7%) | 58/66 (87.9%) | 10/18 | +15.2%p |
| Shard 1 | 49/66 (74.2%) | 57/66 (86.4%) | 8/17 | +12.1%p |
| Shard 2 | 51/66 (77.3%) | 57/66 (86.4%) | 6/15 | +9.1%p |
| Total | 148/198 (74.7%) | 172/198 (86.9%) | 24/50 | +12.1%p |
Verification Round Efficacy
Of 19 questions triggering verification (margin ≤ 1 vote), 12 were successfully corrected (63.2% success rate). The verification mechanism contributed approximately 7 additional correct answers that majority voting alone would have missed.
Hybrid Vigor: CLIcK Korean Benchmark
To validate hybrid vigor across languages, we evaluated a second-generation offspring — Darwin-27B-KR — bred from Darwin-27B-Opus (father) and a Korean-specialized model (mother).
Four-Generation Comparison (200 questions, 0-shot)
| Generation | Model | CLIcK Overall |
|---|---|---|
| Gen 0 (Ancestor) | Qwen3.5-27B | 69.52% |
| Gen 1 (Father) | Darwin-27B-Opus | 70.19% |
| — (Mother) | Korean-specialized SFT | 74.74% |
| Gen 2 (Child) | Darwin-27B-KR | 75.59% ★ |
The child surpasses both parents — winning 7 out of 11 evaluation categories. Largest gains: Law (+9.5pp), Functional Language (+7.6pp), History (+6.5pp).
Two generations of zero-training evolution achieved +6.07 percentage points over the original Qwen3.5-27B foundation model.
Computational Economics
| Darwin-27B-Opus | Conventional Fine-Tuning | |
|---|---|---|
| GPU | H100 × 1 | H100 × 8–64 |
| Time | ~2 hours | Days to weeks |
| Training tokens | 0 | 10⁶–10⁹ |
| Gradient computation | None | Full backpropagation |
| Output model size | Identical to parent | Identical to parent |
| Inference overhead | Zero | Zero |
The resultant model is architecturally indistinguishable from its progenitor — identical parameter count, identical inference speed, identical deployment requirements.
Model Specifications
| Architecture | Qwen3.5 Dense (GatedDeltaNet) |
| Parameters | 27B |
| Hidden Size | 4096 |
| Intermediate Size | 17408 |
| Layers | 64 |
| Context Length | 262,144 (extensible to 1M via YaRN) |
| Precision | BF16 |
| Languages | 201 |
| Thinking Mode | Enabled |
| License | Apache 2.0 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained(
"FINAL-Bench/Darwin-27B-Opus", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-27B-Opus",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
VRAM Requirements
| Setup | VRAM | Status |
|---|---|---|
| BF16 Full Precision | ~55 GB | H100 single GPU |
| NVIDIA H100 80GB | 80 GB | Very comfortable |
| 2× RTX 4090 48GB | 48 GB | Tensor parallel |
| 4-bit Quantized | ~16 GB | RTX 4090 single GPU |
Darwin Model Family
| Model | Gen | Parameters | GPQA Diamond | CLIcK | Specialty |
|---|---|---|---|---|---|
| Darwin-27B-Opus | Gen 1 | 27B | 86.9% ★ | 70.19% | Claude reasoning |
| Darwin-27B-KR | Gen 2 | 27B | — | 75.59% ★ | Korean hybrid vigor |
| Darwin-4B-Genesis | Gen 3 | 4B | ~60% | 92% | Cross-architecture breeding |
| Darwin-31B-Opus | Gen 1 | 31B | 66% | — | Gemma4 reasoning |
| Darwin-35B-A3B-Opus | Gen 1 | 35B MoE | 90% | — | MoE reasoning |
| Darwin-9B-Opus | Gen 1 | 9B | — | — | Edge deployment |
Key Findings
FFN = Knowledge, Attention = Reasoning — Empirically validated through ablation: attention blending causes GPQA collapse (60% → 10%), while FFN crossbreeding consistently enhances performance.
Hybrid vigor scales with model size — Confirmed at 4B (Genesis, CLIcK 92%) and 27B (KR, CLIcK 75.59%).
Zero-training evolution is recursive — Gen 0 → Gen 1 → Gen 2, each generation improving without gradient updates.
CMA-ES discovers what humans cannot — Manual 50:50 blending degrades performance; evolutionary search finds non-obvious optimal ratios.
Verification rounds recover contested answers — 63.2% success rate on close-vote questions, contributing ~7 additional correct answers.
Roadmap
- K-AI Leaderboard official submission (Korean government-certified evaluation)
- MMLU-Pro, AIME 2025 evaluation
- Cross-architecture breeding at 27B scale (Transformer × Mamba FFN)
- Third-generation recursive evolution
- Darwin engine research paper
References
- DARE-TIES: Yadav et al., 2023 (arXiv:2311.03099) — re-implemented without library dependency
- FFN as Cross-Attention: arXiv:2501.00823
- CLIcK: Kim et al., 2024 (arXiv:2403.06412)
- GPQA: Rein et al., 2023 (arXiv:2311.12022)
- CMA-ES: Hansen & Ostermeier, 2001
- Darwin V6 Engine: HuggingFace Space
Built By
| Developer | VIDRAFT |
| Engine | Darwin V6 (Diagnostic-Guided Evolutionary Merge) |
| Architecture | Qwen3.5-27B Dense |
| License | Apache 2.0 |
Citation
@misc{vidraft_darwin_27b_opus_2026,
title = {Darwin-27B-Opus: Surpassing the Foundation Model Without Training},
subtitle = {86.9\% on GPQA Diamond via Evolutionary FFN Crossbreeding},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}}
}
- Downloads last month
- -
Model tree for FINAL-Bench/Darwin-27B-Opus
Base model
Qwen/Qwen3.5-27BCollection including FINAL-Bench/Darwin-27B-Opus
Papers for FINAL-Bench/Darwin-27B-Opus
Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
Evaluation results
- Diamond on Idavidrein/gpqa View evaluation results leaderboard 86.9