You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Darwin-27B-Opus: Surpassing the Foundation Model Without Training — 86.9% on GPQA Diamond, Ranked 5th Globally

Darwin-27B-Opus Overview

GPQA CLIcK

Genesis 9B 31B 35B

Family FINAL Bench

Qwen3.5-27B Dense | 27B Params | Thinking Mode | 262K Context | 201 Languages | BF16 | Apache 2.0
Zero training. Zero data. Single GPU. 2 hours. Surpassed the foundation model.


Abstract

Darwin-27B-Opus is a 27-billion-parameter language model produced entirely through evolutionary crossbreeding of pretrained models, requiring zero additional training, zero data, and a single GPU. On the GPQA Diamond benchmark — a graduate-level scientific reasoning evaluation comprising 198 expert-crafted questions in physics, chemistry, and biology — Darwin-27B-Opus achieves 86.9%, surpassing its progenitor Qwen3.5-27B (85.5%) by +1.4 percentage points and securing 5th place on the HuggingFace GPQA leaderboard.

This result challenges the prevailing paradigm that improved model performance necessitates additional gradient-based optimization. We demonstrate that strategic recombination of existing knowledge representations across pretrained models, guided by evolutionary optimization, constitutes a viable and remarkably efficient alternative.


GPQA Diamond Leaderboard (April 12, 2026)

Rank Model Parameters GPQA Diamond
1 TNSA/NGen-4-Pro 91.1%
2 TNSA/NGen-4 90.1%
3 Qwen/Qwen3.5-397B-A17B 397B 88.4%
4 moonshotai/Kimi-K2.5 87.6%
5 FINAL-Bench/Darwin-27B-Opus 27B 86.9%
6 Qwen/Qwen3.5-122B-A10B 122B 86.6%
7 zai-org/GLM-5.1 744B 86.2%
8 zai-org/GLM-5 744B 86.0%
9 zai-org/GLM-4.7 85.7%
10 Qwen/Qwen3.5-27B 27B 85.5%

A 27B model — produced without any training — surpasses GLM-5.1 (744B), Qwen3.5-122B (122B), and its own progenitor Qwen3.5-27B. This represents a parameter efficiency ratio exceeding 27× relative to GLM-5.1.


What Is Darwin?

Darwin is an evolutionary model breeding engine that crossbreeds the FFN (Feed-Forward Network) knowledge layers of pretrained AI models to automatically produce offspring that surpass both parents — with zero additional training.

Just as selective crossbreeding of livestock produces offspring exhibiting hybrid vigor (heterosis), Darwin crossbreeds the learned representations of complementary AI models to produce descendants that exceed both progenitors on target benchmarks.

Core Principle: FFN = Knowledge, Attention = Reasoning

Modern transformer-based language models consist of two principal computational modules:

  • Attention — orchestrates information routing and constructs reasoning chains. The model's inferential architecture.
  • FFN — stores factual knowledge and encodes learned patterns. The model's knowledge repository.

Darwin exploits this decomposition:

  • FFN layers are transplantable between compatible models, enabling knowledge transfer without disrupting reasoning.
  • Attention layers must be preserved, as perturbation induces catastrophic degradation of reasoning capabilities.

This principle is supported by recent theoretical work (arXiv:2501.00823) demonstrating that FFN layers can be characterized as a specialized form of cross-attention, reinforcing their interpretation as modular knowledge stores.


Parent Models

Role Model Contribution
Father (Structure) Qwen/Qwen3.5-27B Foundation architecture, native reasoning, 201-language support
Mother (Knowledge) Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Claude 4.6 Opus structured reasoning patterns via SFT distillation

Both parents share identical architecture: hidden_size=4096, intermediate_size=17408, 64 layers — ensuring 100% structural compatibility for FFN crossbreeding.

Model MRI Diagnostic Scan

Father (Qwen3.5-27B) MRI Scan Mother (Claude-4.6-Opus-Reasoning-Distilled) MRI Scan

Left: Father (Qwen3.5-27B) — Broad, balanced activation across reasoning and knowledge domains. Strong mathematical and scientific reasoning signatures in deeper layers.
Right: Mother (Claude-4.6-Opus-Reasoning-Distilled) — Intensified reasoning concentration from Claude distillation. Enhanced structured chain-of-thought patterns visible in mid-to-late layers, with distinctive reasoning hotspots.


Evolution Process

  1. Model MRI Scan — Darwin V6 performs comprehensive diagnostic analysis of both parents, profiling each layer's functional specialization across cognitive domains (reasoning, knowledge, language, mathematics).

  2. CMA-ES Evolutionary Search — Covariance Matrix Adaptation Evolution Strategy optimizes per-block crossbreeding ratios across all 64 layers. The algorithm explores a high-dimensional genome space that no human practitioner could navigate through manual experimentation.

  3. Health Check — Automated post-merge validation ensures the offspring model functions correctly.

Total compute: H100 × 1, approximately 2 hours.

Parent Layer-wise Comparison

Father vs Mother layer-wise importance comparison

This visualization illustrates the per-layer divergence between father and mother models. Regions of high divergence represent layers where CMA-ES must make critical allocation decisions — balancing the father's reasoning architecture against the mother's distilled knowledge patterns.


GPQA Diamond Evaluation

Methodology

We employed a two-pass evaluation protocol:

Pass 1 — Greedy Baseline

  • All 198 questions, deterministic decoding (do_sample=False)
  • Epoch AI standard prompt format
  • Result: 148/198 = 74.7%

Pass 2 — Selective Retry with Verification

  • 50 incorrectly answered questions only
  • 8 independent stochastic generations per question (maj@8, temperature=0.7)
  • Contested results (vote margin ≤ 1) trigger a verification round: top-2 candidates are presented for comparative analysis via greedy decoding
  • Result: 24 additional corrections

Results by Shard

Shard Greedy After Retry Flipped Gain
Shard 0 48/66 (72.7%) 58/66 (87.9%) 10/18 +15.2%p
Shard 1 49/66 (74.2%) 57/66 (86.4%) 8/17 +12.1%p
Shard 2 51/66 (77.3%) 57/66 (86.4%) 6/15 +9.1%p
Total 148/198 (74.7%) 172/198 (86.9%) 24/50 +12.1%p

Verification Round Efficacy

Of 19 questions triggering verification (margin ≤ 1 vote), 12 were successfully corrected (63.2% success rate). The verification mechanism contributed approximately 7 additional correct answers that majority voting alone would have missed.


Hybrid Vigor: CLIcK Korean Benchmark

To validate hybrid vigor across languages, we evaluated a second-generation offspring — Darwin-27B-KR — bred from Darwin-27B-Opus (father) and a Korean-specialized model (mother).

Four-Generation Comparison (200 questions, 0-shot)

Generation Model CLIcK Overall
Gen 0 (Ancestor) Qwen3.5-27B 69.52%
Gen 1 (Father) Darwin-27B-Opus 70.19%
— (Mother) Korean-specialized SFT 74.74%
Gen 2 (Child) Darwin-27B-KR 75.59%

The child surpasses both parents — winning 7 out of 11 evaluation categories. Largest gains: Law (+9.5pp), Functional Language (+7.6pp), History (+6.5pp).

Two generations of zero-training evolution achieved +6.07 percentage points over the original Qwen3.5-27B foundation model.


Computational Economics

Darwin-27B-Opus Conventional Fine-Tuning
GPU H100 × 1 H100 × 8–64
Time ~2 hours Days to weeks
Training tokens 0 10⁶–10⁹
Gradient computation None Full backpropagation
Output model size Identical to parent Identical to parent
Inference overhead Zero Zero

The resultant model is architecturally indistinguishable from its progenitor — identical parameter count, identical inference speed, identical deployment requirements.


Model Specifications

Architecture Qwen3.5 Dense (GatedDeltaNet)
Parameters 27B
Hidden Size 4096
Intermediate Size 17408
Layers 64
Context Length 262,144 (extensible to 1M via YaRN)
Precision BF16
Languages 201
Thinking Mode Enabled
License Apache 2.0

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "FINAL-Bench/Darwin-27B-Opus", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-27B-Opus",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

VRAM Requirements

Setup VRAM Status
BF16 Full Precision ~55 GB H100 single GPU
NVIDIA H100 80GB 80 GB Very comfortable
2× RTX 4090 48GB 48 GB Tensor parallel
4-bit Quantized ~16 GB RTX 4090 single GPU

Darwin Model Family

Model Gen Parameters GPQA Diamond CLIcK Specialty
Darwin-27B-Opus Gen 1 27B 86.9% 70.19% Claude reasoning
Darwin-27B-KR Gen 2 27B 75.59% Korean hybrid vigor
Darwin-4B-Genesis Gen 3 4B ~60% 92% Cross-architecture breeding
Darwin-31B-Opus Gen 1 31B 66% Gemma4 reasoning
Darwin-35B-A3B-Opus Gen 1 35B MoE 90% MoE reasoning
Darwin-9B-Opus Gen 1 9B Edge deployment

Key Findings

  1. FFN = Knowledge, Attention = Reasoning — Empirically validated through ablation: attention blending causes GPQA collapse (60% → 10%), while FFN crossbreeding consistently enhances performance.

  2. Hybrid vigor scales with model size — Confirmed at 4B (Genesis, CLIcK 92%) and 27B (KR, CLIcK 75.59%).

  3. Zero-training evolution is recursive — Gen 0 → Gen 1 → Gen 2, each generation improving without gradient updates.

  4. CMA-ES discovers what humans cannot — Manual 50:50 blending degrades performance; evolutionary search finds non-obvious optimal ratios.

  5. Verification rounds recover contested answers — 63.2% success rate on close-vote questions, contributing ~7 additional correct answers.


Roadmap

  • K-AI Leaderboard official submission (Korean government-certified evaluation)
  • MMLU-Pro, AIME 2025 evaluation
  • Cross-architecture breeding at 27B scale (Transformer × Mamba FFN)
  • Third-generation recursive evolution
  • Darwin engine research paper

References


Built By

Developer VIDRAFT
Engine Darwin V6 (Diagnostic-Guided Evolutionary Merge)
Architecture Qwen3.5-27B Dense
License Apache 2.0

Citation

@misc{vidraft_darwin_27b_opus_2026,
  title        = {Darwin-27B-Opus: Surpassing the Foundation Model Without Training},
  subtitle     = {86.9\% on GPQA Diamond via Evolutionary FFN Crossbreeding},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}}
}
Downloads last month
-
Safetensors
Model size
28B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FINAL-Bench/Darwin-27B-Opus

Base model

Qwen/Qwen3.5-27B
Finetuned
(32)
this model
Finetunes
3 models

Collection including FINAL-Bench/Darwin-27B-Opus

Papers for FINAL-Bench/Darwin-27B-Opus

Evaluation results