We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond β surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) β with just 4.5B parameters.
Second-Generation Evolution Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation β a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.
Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively β evolution of evolution.
Benchmarks Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% β expected, as loglikelihood doesn't capture thinking-mode reasoning differences.
Why This Matters gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size β no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.
𧬠Darwin V6: Diagnostic-Guided Evolutionary Model Merging
We are releasing Darwin-31B-Opus β a reasoning-enhanced model merging Google's Gemma-4-31B-it and TeichAI's Claude Opus Distill using the Darwin V6 engine.
Conventional merging tools (mergekit, etc.) apply a single ratio to all tensors. Set ratio=0.5 and all 1,188 tensors blend identically, with no distinction between which tensors matter for reasoning versus coding.
Darwin V6 diagnoses both parents at the tensor level before merging. It measures Shannon entropy, standard deviation, and L2 norm for every tensor, then passes 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) through the model to determine layer-wise functional importance. Each of the 1,188 tensors receives an independent optimal ratio.
combined = static(entropy/std/norm) x 0.4 + probe(cosine_distance) x 0.6 final_ratio = mri_ratio x mri_trust + genome_ratio x (1 - mri_trust)
When one parent is overwhelmingly superior for a tensor (ratio < 0.15 or > 0.85), Darwin transplants it directly without interpolation. The mri_trust parameter itself is optimized by CMA-ES evolutionary search, so optimal transplant intensity is determined automatically. After merging, a Health Check compares the child against both parents layer-by-layer to detect interference or function loss.
𧬠Darwin-35B-A3B-Opus β The Child That Surpassed Both Parents
What if a merged model could beat both its parents? We proved it can. Darwin-35B-A3B-Opus is a 35B MoE model (3B active) built with our Darwin V5 engine β the first evolution system that CT-scans parent models before merging them. π€ Model: FINAL-Bench/Darwin-35B-A3B-Opus
The result speaks for itself: GPQA Diamond 90.0%, versus Father (Qwen3.5-35B-A3B) at 84.2% and Mother (Claude 4.6 Opus Distilled) at 85.0%. That's +6.9% over Father and +5.9% over Mother. Not a tradeoff β a genuine leap. Meanwhile, MMMLU sits at 85.0% (Father: 85.2%), multimodal is fully intact, and all 201 languages are preserved.
How? Model MRI changed everything. Traditional merging is guesswork. Darwin V4 added evolution. Darwin V5 added X-ray vision. Model MRI scans each parent layer by layer and discovers: Mother's L34βL38 is the reasoning engine (peak cosine distance), 50β65% of Mother's experts are dead (killed by text-only distillation), and Father is a healthy generalist with every expert alive. The prescription: transplant Mother's reasoning brain at L38 (90% weight), replace her dead experts with Father's living ones, and let Father's router handle the output layer. Reasoning went up. Versatility stayed intact. No tradeoff β just evolution.
35B total, 3B active (MoE) Β· GPQA Diamond 90.0% Β· MMMLU 85.0% (201 languages) Β· Multimodal Image & Video Β· 262K native context Β· 147.8 tok/s on H100 Β· Runs on a single RTX 4090 (Q4) Β· Apache 2.0 Darwin V5's full algorithm and technical details will be released alongside an upcoming paper.