Text Generation

MicroExperts/NG Architecture

Split or die trying.

Self-organizing 1-bit mixture-of-experts for continual learning without catastrophic forgetting.


Warning is not an Paper its more an Architecture outline that is still not 100% concret like i probably scrap the bitnet part and a more or less dev Diary a the bottom.I still did not do enough testing to prove that it fully works as intended.

Warning 2 the text is partly Ai (Claudeopus and Gemini pro) i rewrote the most cringy sentencs fixed most errors.

Plan

01 · The Problem

Standard neural networks store knowledge in shared parameters. Training on new data overwrites weights encoding old knowledge: catastrophic forgetting.

This was first identified by McCloskey & Cohen, who showed that a backpropagation network trained on ones addition facts completely lost that knowledge when retrained on twos [1]. Ratcliff established that the root cause is representational overlap at hidden layers: when many shared weights change, prior knowledge cannot survive [2]. French's comprehensive review concluded that dual memory systems separating short-term and long-term storage were necessary to overcome it [3].

Existing solutions all have fundamental limits: EWC/SI accumulate penalty until the model can't learn. Replay buffers require storing data forever. Progressive networks grow linearly without sharing. Masking fixes capacity at init and can't grow. The structure itself never self-organizes; it has to be designed.

MicroExperts' response: New knowledge gets new parameters via expert splits. Old knowledge lives in parameters that receive no gradients unless relevant. Protection is structural. And the system grows its own capacity to match data complexity.


02 · Architecture

A transformer where the FFN in each block is replaced by a dynamic Mixture-of-Experts layer with ultra-small 1.58-bit quantized experts.

Tokens → Embedding → Attention → Adaptive Router → Expert Pool → Weighted Sum → LM Head

The experts are small feedforward networks built from BitLinear layers, 1.58-bit quantized linear layers borrowed from Microsoft's BitNet research [15]. Weights are binarized to {-1, 0,+1} using round() with mean absolute value scaling. Activations are quantized to 8-bit via absmax. The straight-through estimator enables gradient flow.

Kawata et al. proved theoretically that vanilla networks fail to detect latent organizational structure in data; they process the problem as a whole. MoE succeeds by dividing it into easier subproblems [9]. Li et al. provided the first theoretical proof that MoE can diversify experts and prevent forgetting in continual learning [8].

Expert Size Tiers

Tier Hidden Dim Params Memory (1-bit) Role
0 512 ~1M ~125 KB Narrow specialists
1 1,024 ~4M ~500 KB Domain experts
2 2,048 ~16M ~2 MB Broad generalists
3 4,096 ~64M ~8 MB Monolith / max capacity

Powers of 4 sizing ensures clean merge arithmetic. Fixed tier sizes also eliminate GPU memory fragmentation — pre-allocated slab pools per tier recycle blocks on death and grab from pool on birth.Expert tiers can be increased later.

Critical property: Inactive experts receive zero gradients. Their weights are structurally untouched. Physical parameter isolation, not regularization or replay, is what forgetting resistance is built on.


03 · The Router

Routes inputs to experts, produces clean signals for lifecycle decisions, and adapts as experts appear and disappear.

Embedding-Similarity (Not Linear)

Each expert has a learned embedding vector in routing space. Routing is cosine similarity between the input (projected through a routing head MLP) and expert embeddings. Adding an expert means adding one vector. Removing means deleting one. No weight matrix resizing.

Adaptive Threshold (Not Top-K)

Fixed top-k is wrong for this architecture. The system is the best of both worlds of moe and dense Models that have the benifit of the interconnectivity of dense models and the domain spesifc specialisation of Moe models.

Simple inputs might activate 1 expert. Complex cross-domain inputs might activate 50. The density itself is a diagnostic signal; spikes indicate distribution shift.

Chen et al. demonstrated that partitioning features into task-specific and shared components processed by dedicated expert groups effectively mitigates gradient conflicts at their source [11]. The adaptive router achieves this dynamically.

Lifecycle Hooks

Event Embedding Action Rationale
Split Preserver: exact copy. Adapter: copy + noise. Preserver handles same inputs; adapter diverges
Merge Child: average of parents Covers both parents' input space
Death Embedding removed Expert exits routing space

Hierarchical Routing (100K Scale)

At 100K experts, two-stage routing: first route to ~316 cluster centroids (√100K), then to experts within selected clusters. Cost: O(√N) instead of O(N). The gradient conflict-driven topology pruning paper notes that memory overhead can be mitigated by sparse conflict sampling [12].


04 · The Cannibalization Signal

The entire system is driven by a single signal, which produces interference between old and new knowledge.

Yu et al. (PCGrad) established that gradient conflict can be detected via negative cosine similarity between task gradients, and identified three destructive conditions: conflicting directions, high curvature, and large magnitude differences [4].

Yang et al. extended this to per-expert, per-token conflict within MoE, computing token-level gradients for each expert and identifying tokens whose gradients conflict with the expert's average optimization direction [5]. This is the closest existing work to MicroExperts' cannibalization signal. The difference: they reassign conflicting tokens (reactive routing); MicroExperts splits the expert (reactive structure).

GCond showed that raw per-batch conflict detection is too noisy; PCGrad gets stuck oscillating. The solution is EMA smoothing with tiered conflict zones [6]. MicroExperts uses the same approach: dual exponential moving averages (fast and slow) tracking each expert's loss. When fast diverges upward from slow, that expert is being cannibalized.

Borsani et al. further showed that magnitude similarity matters alongside directional conflict [7], suggesting the signal could be enriched further.

The measurement: Per-expert interference = L2 distance between the expert's individual output and the combined mixture output, normalized by the expert's output magnitude. High interference = other experts are pulling the result away from what this expert "wants." This is the cannibalization score.


05 · Self-Organization: Split / Merge / Death

The system starts as a single monolith and differentiates through pure training pressure. Structure emerges from self-preservation dynamics.

Split: Self-Preservation (Same Tier)

When cannibalization exceeds threshold, the expert splits into two same-tier children. The preserver inherits exact weights with gradients frozen, for a duration proportional to expert importance. The adapter inherits weights with perturbation and absorbs new gradient pressure. This maps to French's dual-memory insight: preserver = long-term memory (stable), adapter = short-term memory (plastic) [3].

SETA validated this shared/unique separation principle: overlapping parameters become shared experts (stabilized), non-overlapping become unique experts (frozen) [10]. MicroExperts achieves the same dynamically through the split mechanism.

Same-tier splits grow parameters: One tier-2 (16M) becomes two tier-2s (32M total). The system genuinely grows capacity. Expert count increases by 1, total params increase by one expert's worth.

Merge: Consolidation (Tier Up)

Three forces counterbalance splitting:

Merge Force Signal Effect
Fragment Two experts individually weak but co-route Recombines over-split debris → tier+1
Capacity Pool approaching memory budget Back-pressure against unbounded growth
Tier Gravity Small same-tier experts co-activate Consolidates upward: 2×T0 → 1×T1

Tier-up merges grow capacity: Two tier-0s (2M total) merge into one tier-1 (4M). Net gain: 2M params. Every merge-up cycle adds parameters. The system grows through the split→specialize→merge cycle, self-regulated by data complexity.

Death: Controlled Forgetting

Experts with near-zero routing weight for extended periods are removed. Not all knowledge needs to persist. Death frees capacity.

No Spontaneous Birth

Experts are only born via splits. Every expert traces lineage to the original monolith. No random initialization ever. Novel data is absorbed by first cannibalizing the nearest existing expert, which then splits to protect itself.

The gradient conflict-driven topology pruning paper explicitly describes this concept as a future research direction: "We believe that grounding neural architecture search in physical gradient dynamicsrepresents a promising step toward interpretable and self-organizing artificial intelligence." [12].


06 · Continual Learning

Like Conway's Game of Life[16], the system complexity emerges from simple local rules.When new data arrives, the system protects itself automatically through a six-phase cycle.

Phase System State Response
1. Stability Equilibrium, low cannibalization Normal operation
2. New data Router sends to nearest experts; conflicting gradients build Drift detector notices entropy spike; thresholds tighten
3. Self-preservation Cannibalized experts split (same tier); preservers freeze Expert count grows; old knowledge isolated
4. Specialization Adapters learn new domain; router differentiates Density may spike temporarily
5. Consolidation Redundant experts merge (tier up); fragments recombine Count decreases; total params increase
6. New equilibrium System stable at higher capacity Old + new knowledge coexist

Flesch et al. showed that both humans and networks face the same fundamental tradeoff: "lumpers" who reuse representations get better transfer but worse interference, while "splitters" keep them separate to avoid interference at the cost of transfer [13]. MicroExperts navigates this dynamically: shared high-tier experts are lumpers, specialized low-tier experts are splitters.

The coordinated eligibility theory decomposes interference into receptive field and population response factors, showing that plasticity rules can protect against catastrophic interference without requiring gradient alignment with task objectives [14]. This maps to MicroExperts' design: experts are population responses, the router gates receptive fields.

Training implications: Data can be trained sequentially (each transition drives natural splits). Small diverse datasets work well (diversity, not volume, drives differentiation).


07 · Long-Term Structural Evolution

Over extended training, the system should theoraticly develops a knowledge hierarchy, without anyone designing it:

Layer Tier Content Formation
Universal 2–3 Punctuation, numbers, common patterns Small experts merged upward via tier gravity
Cross-domain 1–2 Shared grammar, Latin roots Domain experts found redundant; merged
Domain-specific 0–1 French conjugation, Python syntax Split from shared experts when new data caused cannibalization
Exceptions 0 Irregular verbs, idioms, rare patterns Tiny specialist

The total parameter count self-regulates to match accumulated knowledge complexity. More diverse data → more splits and merges → more growth. Total parameter count scales with accumulated knowledge.


08 · Known Risks

Risk Mitigation
Cannibalization signal too noisy Dual-EMA smoothing validated by GCond [6]; cooldown timer; min age
Merge collapse still no soloution want to avoid replay buffer
Router instability Embedding continuity on split; cooldown between events
Expert starvation at 100K Death mechanism; pressure system
Split/merge oscillation min age before merge; hysteresis; cooldown
1-bit experts too small scaling up the expert

Real Implementation

For this what ever this is i wanted to part it into Plan and Real Implimentation.

The current implementation is on a Mac M4 Pro with 48 GB RAM. That's why I decided to implement it into MLX and not via Bitnet, as I see Bitnet implementation not as a priority and may completely remove it; I am still not sure.

Day 1

Splitting does work at first glance there is no wierd beaviour like an cascade or an repeating pattern so it think the implimenation works dont understand me wrong its still not prooven but i see it as an first success.

Day 2

I currently have the problem of preserving the optimizer state after splitting. Currently, I just reset the optimizer state, but then it loses momentum. I am still thinking of a sophisticated solution. First death of an expert: [step 4550][L5] DEATH 6adccc9b (T2, age=254, w=0.0010) RIP Optimzer State fixed via copying the parent state over to both children dosent work for now i will wipe them

Day 3

I implemted checkpoint saving the wrong way so it only did save the backbone not the experts so i have to train from the beginning again.

Day 4

After interference testing, I had to reduce the overall size again. I underestimated the time it would take with the compute graph recompiling so frequently.

Day 5

Now with redcuced expert and hidden dimension per layers it dosent split anymore.

Day 6

Split, merge, and death are working. Small-scale models are working too, but the lifetime has to be lowered so it works as intended, so the lifetime and size should depend on each other. Maybe I'll try to find some sort of formula in the future.

Day 7

For now i removed copying over the optimzation state cause it cause an crash. I will probaly reimplemted it later cuase its not that importnat currently it now have to built up momentum from strach what sucks .Everything else seems to work.I anlyse the log there is some oscillation basicly split and merge back but its not the norm i excpect oscillation to some degree. 80% of splits stick and dont merge back with the Sibling.

Day 8

An report of claude based fo the logs:

12 monoliths → ~50 experts, 160M params. All lifecycle events fire: 89 splits, 36 merges, 13 deaths, drift detection. No crashes. Loss hit 4.26 at step 970, rose to ~5.0 during rapid growth (optimizer wipes), recovering to ~4.9 by step 10K. Of the 36 merges, 16 were sibling merge-backs (both children from the same parent reuniting) and 20 were non-sibling merges (unrelated weak experts consolidating). 73 out of 89 splits stuck — 82% retention rate.tier-gravity merge at step 10,920. L5 routes to multiple experts (density 1.5); other layers mostly top-1. Throughput: 4K → 1.6K tok/s.

The current biggest problem is the Optimzer state wipe that keeps the model from bulding up momentum cause after every split it wipes the optimzer state and copiying somhow currupts the opitmizer state what is an annoying bug.

Day 9

The Optimzer is now working the Optimzer state is sucessfully copyed over to the children so the base Architecture is now working.

This are the results training on top of gutenberg with teknium/OpenHermes-2.5:

Loss 7.8→2.5. 6 splits, 5 merges, 3 deaths, 0 crashes. 4/6 splits stuck no merge back. L4: 3 splits, 2 merges, density 1.8. L5: 1 merge. L9: 2 splits, 2 deaths, density 1.5. L10: 2 splits, 2 merges, 1 death. L0–L3, L6–L8, L11: 0 events.

The result of this chat fine-tune is bad, but it has nothing to do with the model itself; it has more to do with the fact that I built myself a trash tokenizer that doesn't support special tokens. I will retry it at a later point; for now, it's for completion only.

Here is an short Report with 10 test prompts: 4 monoliths (L4,5,8,9). 3 near-monoliths trending stable (L0,1,2). 4 dynamic with per-prompt routing shifts (L3,6,7,11). L10 borderline.

References

[1] McCloskey, M. & Cohen, N.J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. https://www.andywills.info/hbab/mccloskeycohen.pdf

[2] Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97(2), 285–308. https://bpb-us-w2.wpmucdn.com/u.osu.edu/dist/6/60429/files/2018/07/psychrev90a-1jt2c34.pdf

[3] French, R.M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135. https://lead.ube.fr/wp-content/uploads/2023/09/000282-catastrophic-forgetting-in-connectionist-networks.pdf

[4] Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K. & Finn, C. (2020). Gradient Surgery for Multi-Task Learning. NeurIPS 2020. https://arxiv.org/pdf/2001.06782

[5] Yang, L. et al. (2025). Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model. ICLR 2025. https://arxiv.org/abs/2406.19905

[6] GCond. (2025). Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning. arXiv:2509.07252. https://arxiv.org/abs/2509.07252

[7] Borsani, T., Rosani, A., Nicosia, G. & Di Fatta, G. (2025). Gradient Similarity Surgery in Multi-Task Deep Learning. arXiv:2506.06130. Accepted at ECML PKDD 2025. https://arxiv.org/abs/2506.06130

[8] Li, H. et al. (2024). Theory on Mixture-of-Experts in Continual Learning. arXiv:2406.16437. https://arxiv.org/abs/2406.16437

[9] Kawata, R. et al. (2025). Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning. arXiv:2506.01656. https://arxiv.org/abs/2506.01656

[10] Siddika, F. et al. (2026). Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning (SETA). arXiv:2601.17616. https://arxiv.org/abs/2601.17616

[11] Chen, J. et al. (2024). Mitigating Gradient Conflicts via Expert Squads in Multi-Task Learning. Neurocomputing, 128832. https://github.com/chenjie04/Multi-Task-Learning-PyTorch, https://www.sciencedirect.com/science/article/abs/pii/S0925231224016035

[12] Anonymous. (2025). MoE with Gradient Conflict-Driven Subspace Topology Pruning for Emergent Modularity. arXiv:2512.20291. https://arxiv.org/abs/2512.20291

[13] Flesch, T. et al. (2025). Humans and neural networks show similar patterns of transfer and interference during continual learning. Nature Human Behaviour. https://www.nature.com/articles/s41562-025-02318-y

[14] eLife. (2024). Beyond Gradients: Factorized, Geometric Control of Interference and Generalization. eLife 103701. https://elifesciences.org/reviewed-preprints/103701

[15] Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). Microsoft Research, arXiv:2402.17764. https://arxiv.org/abs/2402.17764

[16] Conway's Game of Life.https://noweyr.github.io,https://conwaylife.com/wiki/

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train gustavlangstroem/Microexpert_NG

Papers for gustavlangstroem/Microexpert_NG