SeaWolf-AI (VIDRAFT

reacted to their post with 😎 about 17 hours ago

Post

2173

Darwin V9 — GPQA Diamond 90.9%, #1 on the leaderboard, with pure greedy decoding
Darwin-398B-JGOS reaches 90.9% (180/198) on GPQA Diamond, the PhD-level scientific reasoning benchmark, ranking #1 on the Hugging Face GPQA Diamond leaderboard. No self-consistency, no test-time compute scaling — this was achieved with a single greedy decode (temperature 0, single sample, max 16,384 tokens). The full eval config is published in the model card, so anyone can reproduce it. Raw reasoning, no score inflation.
The result comes from Darwin V9, a patented evolutionary model-development platform. Its core idea: it never trains a model from scratch.
Why Darwin V9 beats training from scratch

Cost & speed: no trillion-token pretraining run, no months of compute — a purpose-built, high-performance model is produced in a fraction of the time.
Reuse of proven intelligence: instead of re-learning every capability from a blank slate, it selects and combines only the strengths of already-trained, already-validated models, so results are stable and predictable.
Surgical transplantation: it identifies which neural region of which model holds which capability — at the FFN (Feed Forward Network) layer level — and grafts in only the segments that contribute to the target skill.

How it works: a large model (Qwen 3.5 397B) serves as the mother model (the substrate); several father models specialized in reasoning, coding, and language are analyzed layer-by-layer across their FFN regions; the segments that contribute to the target performance are extracted and transplanted into the mother model to produce a new child model. The result is a ~400B MoE that activates only ~17B parameters per token at inference — large-model capacity with efficient inference.
If training from scratch means rebuilding everything from a blank page, Darwin V9 means precisely recombining intelligence that has already been proven. GPQA Diamond #1 is the proof.
Model: FINAL-Bench/Darwin-398B-JGOS

posted an update about 17 hours ago

Post

2173

Darwin V9 — GPQA Diamond 90.9%, #1 on the leaderboard, with pure greedy decoding
Darwin-398B-JGOS reaches 90.9% (180/198) on GPQA Diamond, the PhD-level scientific reasoning benchmark, ranking #1 on the Hugging Face GPQA Diamond leaderboard. No self-consistency, no test-time compute scaling — this was achieved with a single greedy decode (temperature 0, single sample, max 16,384 tokens). The full eval config is published in the model card, so anyone can reproduce it. Raw reasoning, no score inflation.
The result comes from Darwin V9, a patented evolutionary model-development platform. Its core idea: it never trains a model from scratch.
Why Darwin V9 beats training from scratch

Cost & speed: no trillion-token pretraining run, no months of compute — a purpose-built, high-performance model is produced in a fraction of the time.
Reuse of proven intelligence: instead of re-learning every capability from a blank slate, it selects and combines only the strengths of already-trained, already-validated models, so results are stable and predictable.
Surgical transplantation: it identifies which neural region of which model holds which capability — at the FFN (Feed Forward Network) layer level — and grafts in only the segments that contribute to the target skill.

How it works: a large model (Qwen 3.5 397B) serves as the mother model (the substrate); several father models specialized in reasoning, coding, and language are analyzed layer-by-layer across their FFN regions; the segments that contribute to the target performance are extracted and transplanted into the mother model to produce a new child model. The result is a ~400B MoE that activates only ~17B parameters per token at inference — large-model capacity with efficient inference.
If training from scratch means rebuilding everything from a blank page, Darwin V9 means precisely recombining intelligence that has already been proven. GPQA Diamond #1 is the proof.
Model: FINAL-Bench/Darwin-398B-JGOS

replied to their post 1 day ago

Got it. We can use this as a baseline to compare against QPU-1 and evaluate the performance differences.

reacted to their post with 🔥 1 day ago

Post

6083

🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark

2 replies

·

posted an update 1 day ago

Post

6083

🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark

2 replies

·

reacted to PhysiQuanty's post with 🔥 18 days ago

Post

4636

🌐 We crawled the entirety of Hugging Face to help the community! Huge thanks to the Hugging Face API 🌐
🤖 2.91M model repos (file names included), 📚 1.02M dataset repos, 🚀 1.31M Space repos
🤗 617,501 committers (datasets and models), we’ll share Hugging Face statistics with you in the coming days..

We also identified 61,398 users with “AI/ML Interests”, and NOW we can find each other through our “AI/ML Interests”🤗
HF-Collab-Center/Searching-For-HuggingFace-Users
HF-Collab-Center/All-Model-Repos
HF-Collab-Center/All-Dataset-Repos
HF-Collab-Center/All-Space-Repos

HF-Collab-Center/HF-Users
HF-Collab-Center/HF-Users-with-last-seen
HF-Collab-Center/HF-Users-With-AI-ML-Interests-Only

Made By @QuantaSparkLabs and @PhysiQuanty
C'est français, bon.. en anglais.. mais c'est français ;)

5 replies

·

reacted to their post with 🧠 19 days ago

Post

4245

Darwin-60B-DUO: Two SOTAs, One Endpoint — 88.38% on GPQA Diamond 🚀

We're excited to release Darwin-60B-DUO, the Darwin family's first DUO model. Take two domain-verified specialists, hide them behind a single OpenAI-compatible endpoint, and let a router decide which one (or both) answers. You see one model, one API — but get the best of both.

The number that matters: on the full 198-question GPQA Diamond, Darwin-60B-DUO hits 88.38%. The constituents alone land at 69.70% (Darwin-28B-REASON) and 77.27% (AWAXIS-Think-31B); a naive cascade only reaches 83.84%. The DUO clears them all. Two small specialists, intelligently routed, beat one big generalist on cost and quality. Both are independently verified — Darwin-28B-REASON is #3 on the HF GPQA Diamond leaderboard, AWAXIS-Think-31B is #1 on Korea's national K-AI Leaderboard (MSIT).

The brains is a Hybrid-A router picking one of five strategies on the fly. Korean → AWAXIS, English/STEM → Darwin (single-backend, ~70% of traffic at 1× cost). When a Korean answer needs rigorous English reasoning, split_refine fires — Darwin drafts, AWAXIS polishes; MCQ/short-answer runs both with self-consistency + cross-verify. Net effective cost: only ~1.3× a single 30B model.

The part the community will care about: the gateway is model-agnostic and Apache-2.0. Point it at any two OpenAI-compatible backends and you've got a DUO in minutes — teach router.py when to use which, and parallel calls, response merging, and routing transparency via _duo_route are handled for you. Fork it and tell us what you built.

Painless deploy: docker compose up for both vLLM backends + gateway; FP8 ~30GB colocates on a single B200/H100. One git clone (~120GB). Text-only for now, streaming in v1.1.
Two SOTAs, one endpoint. Come build your own on the Community tab.

👇
🔗 FINAL-Bench/Darwin-60B-DUO

posted an update 19 days ago

Post

4245

Darwin-60B-DUO: Two SOTAs, One Endpoint — 88.38% on GPQA Diamond 🚀

We're excited to release Darwin-60B-DUO, the Darwin family's first DUO model. Take two domain-verified specialists, hide them behind a single OpenAI-compatible endpoint, and let a router decide which one (or both) answers. You see one model, one API — but get the best of both.

The number that matters: on the full 198-question GPQA Diamond, Darwin-60B-DUO hits 88.38%. The constituents alone land at 69.70% (Darwin-28B-REASON) and 77.27% (AWAXIS-Think-31B); a naive cascade only reaches 83.84%. The DUO clears them all. Two small specialists, intelligently routed, beat one big generalist on cost and quality. Both are independently verified — Darwin-28B-REASON is #3 on the HF GPQA Diamond leaderboard, AWAXIS-Think-31B is #1 on Korea's national K-AI Leaderboard (MSIT).

The brains is a Hybrid-A router picking one of five strategies on the fly. Korean → AWAXIS, English/STEM → Darwin (single-backend, ~70% of traffic at 1× cost). When a Korean answer needs rigorous English reasoning, split_refine fires — Darwin drafts, AWAXIS polishes; MCQ/short-answer runs both with self-consistency + cross-verify. Net effective cost: only ~1.3× a single 30B model.

The part the community will care about: the gateway is model-agnostic and Apache-2.0. Point it at any two OpenAI-compatible backends and you've got a DUO in minutes — teach router.py when to use which, and parallel calls, response merging, and routing transparency via _duo_route are handled for you. Fork it and tell us what you built.

Painless deploy: docker compose up for both vLLM backends + gateway; FP8 ~30GB colocates on a single B200/H100. One git clone (~120GB). Text-only for now, streaming in v1.1.
Two SOTAs, one endpoint. Come build your own on the Community tab.

👇
🔗 FINAL-Bench/Darwin-60B-DUO

reacted to their post with 🔥 about 1 month ago

Post

5425

🧬 Darwin Family: Zero Gradient Steps, GPQA Diamond 88.89%

How far can we push LLM reasoning *without* training?

Our team at VIDRAFT submitted this paper to Daily Papers yesterday, and it's
currently #3. Huge thanks to everyone who upvoted — sharing the core ideas below.

🔗 Paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning (2605.14386)
🔗 arXiv: https://arxiv.org/abs/2605.14386
🔗 Model: FINAL-Bench/Darwin-28B-REASON
🔗 Model: FINAL-Bench/Darwin-28B-Opus

---

TL;DR

Darwin Family is a training-free evolutionary merging framework.
By recombining the weight spaces of existing LLM checkpoints — with zero
gradient-based training — it reaches frontier-level reasoning.

- 🏆 Darwin-28B-Opus: GPQA Diamond 88.89%
- 💸 Zero gradient steps — not a single B200 or H200 hour needed
- 🧬 Consistent gains across 4B → 35B scale
- 🔀 Cross-architecture breeding between Transformer and Mamba families
- 🔁 Stable recursive multi-generation evolution

#Three Core Mechanisms

① 14-dim Adaptive Merge Genome — fine-grained recombination at both
component level (Attention / FFN / MLP / LayerNorm / Embedding) and block
level, expanding the prior evolutionary-merge search space.

② MRI-Trust Fusion — we diagnose each layer's reasoning contribution
via an **MRI (Model Reasoning Importance)** signal and fuse it with
evolutionary search through a **learnable trust parameter**. Trust the
diagnostic too much and search collapses; ignore it and search becomes
inefficient — Darwin learns the balance from data.

③ Architecture Mapper — weight-space breeding across heterogeneous
families. Attention × SSM crossover actually works.

Why It Matters
> Diagnose latent capabilities already encoded in open checkpoints,
> and recombine them — no gradients required.

Replies and critiques welcome 🙌

3 replies

·

replied to their post about 1 month ago

Thank you for the kind words! More to come. 🙏

reacted to their post with ❤️ about 1 month ago

Post

5425

🧬 Darwin Family: Zero Gradient Steps, GPQA Diamond 88.89%

How far can we push LLM reasoning *without* training?

Our team at VIDRAFT submitted this paper to Daily Papers yesterday, and it's
currently #3. Huge thanks to everyone who upvoted — sharing the core ideas below.

🔗 Paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning (2605.14386)
🔗 arXiv: https://arxiv.org/abs/2605.14386
🔗 Model: FINAL-Bench/Darwin-28B-REASON
🔗 Model: FINAL-Bench/Darwin-28B-Opus

---

TL;DR

Darwin Family is a training-free evolutionary merging framework.
By recombining the weight spaces of existing LLM checkpoints — with zero
gradient-based training — it reaches frontier-level reasoning.

- 🏆 Darwin-28B-Opus: GPQA Diamond 88.89%
- 💸 Zero gradient steps — not a single B200 or H200 hour needed
- 🧬 Consistent gains across 4B → 35B scale
- 🔀 Cross-architecture breeding between Transformer and Mamba families
- 🔁 Stable recursive multi-generation evolution

#Three Core Mechanisms

① 14-dim Adaptive Merge Genome — fine-grained recombination at both
component level (Attention / FFN / MLP / LayerNorm / Embedding) and block
level, expanding the prior evolutionary-merge search space.

② MRI-Trust Fusion — we diagnose each layer's reasoning contribution
via an **MRI (Model Reasoning Importance)** signal and fuse it with
evolutionary search through a **learnable trust parameter**. Trust the
diagnostic too much and search collapses; ignore it and search becomes
inefficient — Darwin learns the balance from data.

③ Architecture Mapper — weight-space breeding across heterogeneous
families. Attention × SSM crossover actually works.

Why It Matters
> Diagnose latent capabilities already encoded in open checkpoints,
> and recombine them — no gradients required.

Replies and critiques welcome 🙌

3 replies

·

posted an update about 1 month ago

Post

5425

🧬 Darwin Family: Zero Gradient Steps, GPQA Diamond 88.89%

How far can we push LLM reasoning *without* training?

Our team at VIDRAFT submitted this paper to Daily Papers yesterday, and it's
currently #3. Huge thanks to everyone who upvoted — sharing the core ideas below.

🔗 Paper: Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning (2605.14386)
🔗 arXiv: https://arxiv.org/abs/2605.14386
🔗 Model: FINAL-Bench/Darwin-28B-REASON
🔗 Model: FINAL-Bench/Darwin-28B-Opus

---

TL;DR

Darwin Family is a training-free evolutionary merging framework.
By recombining the weight spaces of existing LLM checkpoints — with zero
gradient-based training — it reaches frontier-level reasoning.

- 🏆 Darwin-28B-Opus: GPQA Diamond 88.89%
- 💸 Zero gradient steps — not a single B200 or H200 hour needed
- 🧬 Consistent gains across 4B → 35B scale
- 🔀 Cross-architecture breeding between Transformer and Mamba families
- 🔁 Stable recursive multi-generation evolution

#Three Core Mechanisms

① 14-dim Adaptive Merge Genome — fine-grained recombination at both
component level (Attention / FFN / MLP / LayerNorm / Embedding) and block
level, expanding the prior evolutionary-merge search space.

② MRI-Trust Fusion — we diagnose each layer's reasoning contribution
via an **MRI (Model Reasoning Importance)** signal and fuse it with
evolutionary search through a **learnable trust parameter**. Trust the
diagnostic too much and search collapses; ignore it and search becomes
inefficient — Darwin learns the balance from data.

③ Architecture Mapper — weight-space breeding across heterogeneous
families. Attention × SSM crossover actually works.

Why It Matters
> Diagnose latent capabilities already encoded in open checkpoints,
> and recombine them — no gradients required.

Replies and critiques welcome 🙌

3 replies

·

reacted to their post with ❤️ about 2 months ago

Post

5098

🌌 Introducing Model Galaxy — a Living, Multimodal Fork of the HF Model Atlas

👉 Try it: FINAL-Bench/model-galaxy

This Space is a fork of the brilliant Eliahu/Model-Atlas, the official demo of "Charting and Navigating Hugging Face's Model Atlas" (Horwitz et al., arXiv 2503.10633). Their pre-computed HF model graph is the foundation of every node and edge you see, and we are deeply grateful for its open release.

The original atlas is a static snapshot of early 2025. Model Galaxy turns it into a living, multimodal map. We injected the 2026 trending originals that did not exist when the atlas was frozen — DeepSeek-V4, Hy3-preview, GLM-5.1, Kimi-K2, gpt-oss, Nemotron-3 Super / Nano / Omni, Hermes-4.3, Qwen3-Coder-Next, Llama-3.3, Granite-4.1, plus the latest multimodal releases (FLUX.2, ERNIE-Image, HunyuanImage / Video, LTX-2.3, Wan2.2, Kokoro-82M, VoxCPM2, Voxtral-TTS, whisper-v3-turbo, Gemma-4, Qwen3-Omni, Phi-4-mm) — each with proper base_model lineage edges.

We also added the complete VIDRAFT Darwin family ontology: 120 nodes covering Darwin Core, AETHER, every brand variant (Rogue, AWAXIS, TenOS, Warecube), NOESIS-Darwin multimodal extensions, and 40+ community quantizations — the most complete Darwin lineage view anywhere.

The name "Galaxy" is now literal: our three injected clusters are re-laid out as logarithmic spiral galaxies, with bigger models near the bright cores and quantizations scattering to the outer arms — just like real star mass distribution. A top-right toggle switches between Galaxy mode (deep-space gradient with 220 animated stars) and Atlas mode (clean white panels for reports). A 15-second progress bar narrates the render, and per-modality / per-company colors make every cluster legible at a glance.

Final scale: 22,480 nodes in the default Modalities atlas, 137,324 in the Large NLP atlas, and a 277-node compact Darwin + Trending view for instant exploration. Feedback and PRs welcome.

posted an update about 2 months ago

Post

5098

🌌 Introducing Model Galaxy — a Living, Multimodal Fork of the HF Model Atlas

👉 Try it: FINAL-Bench/model-galaxy

This Space is a fork of the brilliant Eliahu/Model-Atlas, the official demo of "Charting and Navigating Hugging Face's Model Atlas" (Horwitz et al., arXiv 2503.10633). Their pre-computed HF model graph is the foundation of every node and edge you see, and we are deeply grateful for its open release.

The original atlas is a static snapshot of early 2025. Model Galaxy turns it into a living, multimodal map. We injected the 2026 trending originals that did not exist when the atlas was frozen — DeepSeek-V4, Hy3-preview, GLM-5.1, Kimi-K2, gpt-oss, Nemotron-3 Super / Nano / Omni, Hermes-4.3, Qwen3-Coder-Next, Llama-3.3, Granite-4.1, plus the latest multimodal releases (FLUX.2, ERNIE-Image, HunyuanImage / Video, LTX-2.3, Wan2.2, Kokoro-82M, VoxCPM2, Voxtral-TTS, whisper-v3-turbo, Gemma-4, Qwen3-Omni, Phi-4-mm) — each with proper base_model lineage edges.

We also added the complete VIDRAFT Darwin family ontology: 120 nodes covering Darwin Core, AETHER, every brand variant (Rogue, AWAXIS, TenOS, Warecube), NOESIS-Darwin multimodal extensions, and 40+ community quantizations — the most complete Darwin lineage view anywhere.

The name "Galaxy" is now literal: our three injected clusters are re-laid out as logarithmic spiral galaxies, with bigger models near the bright cores and quantizations scattering to the outer arms — just like real star mass distribution. A top-right toggle switches between Galaxy mode (deep-space gradient with 220 animated stars) and Atlas mode (clean white panels for reports). A 15-second progress bar narrates the render, and per-modality / per-company colors make every cluster legible at a glance.

Final scale: 22,480 nodes in the default Modalities atlas, 137,324 in the Large NLP atlas, and a 277-node compact Darwin + Trending view for instant exploration. Feedback and PRs welcome.

reacted to their post with 🔥 about 2 months ago

Post

8776

🧬 Introducing Darwin-9B-NEG — the first model with Native Entropy Gating (NEG)

🔗 Try it now: FINAL-Bench/Darwin-9B-NEG
🔗 Q4 bit : FINAL-Bench/Darwin-9B-MFP4

We're thrilled to release Darwin-9B-NEG, a 9B-parameter reasoning model
that embeds an architecturally-internalised sense of self-confidence directly
into the transformer — our proprietary Native Entropy Gating (NEG) technology.

📊 GPQA Diamond (198 PhD-level questions):

▸ Baseline Darwin-9B (no NEG) → 51.01 %
▸ Pure NEG (greedy · 1× cost) → 63.64 % 🔥 +12.63 %p
▸ + Permutation (4× cost) → 76.26 %
▸ + Ensemble Refinement (~20×) → 84.34 % 🏆

With only 9 billion parameters and 1× inference cost, Pure NEG jumps
+12.63 %p over the same model without NEG. Going all-in with ensemble
refinement pushes it to 84.34 % — surpassing the published Qwen3.5-9B
leaderboard score (81.7 %) by +2.64 %p.

🔬 What makes NEG different from Multi-Turn Iteration (MTI)?

Classical MTI needs 3-8× extra inference passes. NEG instead lives
INSIDE the single decoding loop. Two tiny modules ride with the
transformer: NEG-Head predicts per-token entropy from the last hidden
state, and NEG-Gate conditionally restricts the top-k choice when
confidence is low. The gate activates in only 4.36 % of tokens —
essentially free at inference time.

✨ Key differentiators
• Architecturally internalised — model file *is* the feature
• 1× inference cost (vs. 3-8× for MTI)
• Drop-in with vLLM / SGLang / TGI / transformers — no extra engine
• +12.63 %p reasoning at zero latency overhead
• Single-file deployment, Apache 2.0 licensed

🧬 Lineage
Qwen/Qwen3.5-9B → Darwin-9B-Opus (V7 evolutionary merge) → Darwin-9B-NEG (V8 + NEG training)

#Darwin #NEG #NativeEntropyGating #GPQA #Reasoning #LLM #OpenSource #Apache2

posted an update about 2 months ago

Post

8776

🧬 Introducing Darwin-9B-NEG — the first model with Native Entropy Gating (NEG)

🔗 Try it now: FINAL-Bench/Darwin-9B-NEG
🔗 Q4 bit : FINAL-Bench/Darwin-9B-MFP4

We're thrilled to release Darwin-9B-NEG, a 9B-parameter reasoning model
that embeds an architecturally-internalised sense of self-confidence directly
into the transformer — our proprietary Native Entropy Gating (NEG) technology.

📊 GPQA Diamond (198 PhD-level questions):

▸ Baseline Darwin-9B (no NEG) → 51.01 %
▸ Pure NEG (greedy · 1× cost) → 63.64 % 🔥 +12.63 %p
▸ + Permutation (4× cost) → 76.26 %
▸ + Ensemble Refinement (~20×) → 84.34 % 🏆

With only 9 billion parameters and 1× inference cost, Pure NEG jumps
+12.63 %p over the same model without NEG. Going all-in with ensemble
refinement pushes it to 84.34 % — surpassing the published Qwen3.5-9B
leaderboard score (81.7 %) by +2.64 %p.

🔬 What makes NEG different from Multi-Turn Iteration (MTI)?

Classical MTI needs 3-8× extra inference passes. NEG instead lives
INSIDE the single decoding loop. Two tiny modules ride with the
transformer: NEG-Head predicts per-token entropy from the last hidden
state, and NEG-Gate conditionally restricts the top-k choice when
confidence is low. The gate activates in only 4.36 % of tokens —
essentially free at inference time.

✨ Key differentiators
• Architecturally internalised — model file *is* the feature
• 1× inference cost (vs. 3-8× for MTI)
• Drop-in with vLLM / SGLang / TGI / transformers — no extra engine
• +12.63 %p reasoning at zero latency overhead
• Single-file deployment, Apache 2.0 licensed

🧬 Lineage
Qwen/Qwen3.5-9B → Darwin-9B-Opus (V7 evolutionary merge) → Darwin-9B-NEG (V8 + NEG training)

#Darwin #NEG #NativeEntropyGating #GPQA #Reasoning #LLM #OpenSource #Apache2

reacted to their post with 🔥 2 months ago

Post

4473

Darwin-TTS: 3% of an LLM's Brain Makes TTS Speak with Emotion — Zero Training

We blended 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B's talker module. The result: emotionally enhanced speech synthesis — with zero training, zero data, and zero GPU hours.

Try the Demo: FINAL-Bench/Darwin-TTS-1.7B-Cross

Model Weights: FINAL-Bench/Darwin-TTS-1.7B-Cross

Full Research Article: https://huggingface.co/blog/FINAL-Bench/darwin-tts

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture — same hidden_size (2048), same layers (28), same heads (16). This enabled pure 1:1 weight blending across 84 FFN tensors with a single lerp operation. At 3% blend, emotion appears. At 5%, emotion intensifies. At 10%, the model breaks — producing 655-second outputs for a 3-second sentence, because the LLM's "keep generating" pattern overwhelms the TTS stop signal.

To our knowledge, this is the first training-free cross-modal weight transfer between an LLM and a TTS model. Prior work either requires adapter training (SmolTolk, 2025), fine-tuning (CSLM, 2025), or massive end-to-end compute (GPT-4o). Darwin-TTS achieves cross-modal capability transfer in under 2 minutes on CPU.

The key insight: TTS models with LLM backbones already "think" in language. We're just restoring 3% of the original LLM's language understanding patterns — particularly those related to emotional semantics and prosody planning. The code is three lines: load the model, load the LLM FFN, call p.lerp_(llm_weight, 0.03).

creators of the Darwin Evolutionary Merge Framework.
Darwin LLM V7 achieved GPQA Diamond 86.9% (HF Benchmark #3)
through CMA-ES optimized FFN crossbreeding. Darwin-TTS extends this principle from LLM-to-LLM merging into cross-modal LLM-to-TTS transfer. Apache 2.0.

posted an update 2 months ago

Post

4473

Darwin-TTS: 3% of an LLM's Brain Makes TTS Speak with Emotion — Zero Training

We blended 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B's talker module. The result: emotionally enhanced speech synthesis — with zero training, zero data, and zero GPU hours.

Try the Demo: FINAL-Bench/Darwin-TTS-1.7B-Cross

Model Weights: FINAL-Bench/Darwin-TTS-1.7B-Cross

Full Research Article: https://huggingface.co/blog/FINAL-Bench/darwin-tts

Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture — same hidden_size (2048), same layers (28), same heads (16). This enabled pure 1:1 weight blending across 84 FFN tensors with a single lerp operation. At 3% blend, emotion appears. At 5%, emotion intensifies. At 10%, the model breaks — producing 655-second outputs for a 3-second sentence, because the LLM's "keep generating" pattern overwhelms the TTS stop signal.

To our knowledge, this is the first training-free cross-modal weight transfer between an LLM and a TTS model. Prior work either requires adapter training (SmolTolk, 2025), fine-tuning (CSLM, 2025), or massive end-to-end compute (GPT-4o). Darwin-TTS achieves cross-modal capability transfer in under 2 minutes on CPU.

The key insight: TTS models with LLM backbones already "think" in language. We're just restoring 3% of the original LLM's language understanding patterns — particularly those related to emotional semantics and prosody planning. The code is three lines: load the model, load the LLM FFN, call p.lerp_(llm_weight, 0.03).

creators of the Darwin Evolutionary Merge Framework.
Darwin LLM V7 achieved GPQA Diamond 86.9% (HF Benchmark #3)
through CMA-ES optimized FFN crossbreeding. Darwin-TTS extends this principle from LLM-to-LLM merging into cross-modal LLM-to-TTS transfer. Apache 2.0.

reacted to their post with 🔥 2 months ago

Post

5958

🧬 Darwin-27B-Opus: 86.9% on GPQA Diamond — World #5, Zero Training
We are excited to share Darwin-27B-Opus, a 27B model that achieved 86.9% on GPQA Diamond — ranking #5 globally on the HuggingFace leaderboard — without a single gradient update.

How? Darwin breeds pretrained models through evolutionary FFN crossbreeding. The father (Qwen3.5-27B) provides the reasoning architecture; the mother (Claude 4.6 Opus Reasoning Distilled) contributes structured chain-of-thought knowledge. CMA-ES automatically discovers optimal per-layer blending ratios — no human tuning required.

The result surpasses the original Qwen3.5-27B (85.5%), GLM-5.1 (744B, 86.2%), and Qwen3.5-122B (86.6%). A 27B model outperforming 744B — with zero training, zero data, one GPU, ~2 hours.

We also confirmed hybrid vigor on Korean benchmarks: Darwin-27B-KR (2nd generation offspring) surpassed both parents on CLIcK, winning 7 out of 11 categories. The evolutionary optimizer independently assigned 93% of FFN from the Korean-specialized mother while preserving 93% of attention from the reasoning-specialized father — autonomously validating our core principle: FFN carries knowledge, Attention carries reasoning.

📊 Public release: 10 days → 300+ community derivatives, 120K+ downloads.

🔗 Links:
Darwin-27B-Opus: FINAL-Bench/Darwin-27B-Opus
article: https://huggingface.co/blog/FINAL-Bench/darwin-gpqa
Darwin Family Collection: https://huggingface.co/collections/FINAL-Bench/darwin-family

If foundation models are raw ore, Darwin is the forge. We are just getting started. 🔥

posted an update 2 months ago

Post

5958

🧬 Darwin-27B-Opus: 86.9% on GPQA Diamond — World #5, Zero Training
We are excited to share Darwin-27B-Opus, a 27B model that achieved 86.9% on GPQA Diamond — ranking #5 globally on the HuggingFace leaderboard — without a single gradient update.

How? Darwin breeds pretrained models through evolutionary FFN crossbreeding. The father (Qwen3.5-27B) provides the reasoning architecture; the mother (Claude 4.6 Opus Reasoning Distilled) contributes structured chain-of-thought knowledge. CMA-ES automatically discovers optimal per-layer blending ratios — no human tuning required.

The result surpasses the original Qwen3.5-27B (85.5%), GLM-5.1 (744B, 86.2%), and Qwen3.5-122B (86.6%). A 27B model outperforming 744B — with zero training, zero data, one GPU, ~2 hours.

We also confirmed hybrid vigor on Korean benchmarks: Darwin-27B-KR (2nd generation offspring) surpassed both parents on CLIcK, winning 7 out of 11 categories. The evolutionary optimizer independently assigned 93% of FFN from the Korean-specialized mother while preserving 93% of attention from the reasoning-specialized father — autonomously validating our core principle: FFN carries knowledge, Attention carries reasoning.

📊 Public release: 10 days → 300+ community derivatives, 120K+ downloads.

🔗 Links:
Darwin-27B-Opus: FINAL-Bench/Darwin-27B-Opus
article: https://huggingface.co/blog/FINAL-Bench/darwin-gpqa
Darwin Family Collection: https://huggingface.co/collections/FINAL-Bench/darwin-family

If foundation models are raw ore, Darwin is the forge. We are just getting started. 🔥

VIDRAFT_LAB

AI & ML interests

Recent Activity

Organizations

VIDRAFT_LAB

AI & ML interests

Recent Activity

Organizations

SeaWolf-AI's activity