CDM-Code-37M

Competitive Docking Memory β€” Code Model β€” 37M parameter language model trained on 200M tokens of Python code (codeparrot/codeparrot-clean-train).

This model spontaneously develops hierarchical scope registers β€” depth-stratified memory routing that mirrors Python's syntactic nesting structure β€” without any structural supervision. The model was trained only on next-token prediction.


Key Finding: Emergent Scope Registers

Without any explicit supervision about code structure, AST depth, or indentation, CDM develops routing behavior that mirrors syntactic nesting:

Nesting depth Routed to slots
Depth 0 (class/module declarations) Slots 8, 15, 13
Depth 1 (method definitions) Distributed
Depth 2+ (deep nested code) Slots 7, 3, 5

MI(slot assignment; indentation depth) = 0.1467 at training completion (step 30k). This is confirmed by two independent methods:

  1. JSON probe: full routing distributions across full dataset β†’ MI_ratio = 0.1467
  2. Gate-routing probe: single-sample argmax β†’ MI = 0.281 bits

The scope register effect emerges because Python's next-token distribution is depth-dependent: return after a method body has very different continuations than return inside a nested loop. CDM learns to allocate distinct memory slots to different nesting contexts as a side effect of minimizing next-token prediction loss.


Training Results

Model Val CE Dataset Notes
CDM Code (this model) 1.3483 codeparrot 200M tok Best at step 28.5k/29k
CDM V3 (TinyStories) 1.5831 TinyStories Same architecture, different domain

Training: 30k steps, seq_len=256, batch=16, AdamW lr=3e-4. Architecture: CDMLanguageModelV2 (input-dependent Ξ· network for alpha, not global log_alpha).

Val CE trajectory: 1.51(15k) β†’ 1.44(18k) β†’ 1.40(21k) β†’ 1.36(26k) β†’ 1.3483(28.5k) β†’ 1.3487(30k)


Syntactic Role Taxonomy

The gate-routing probe reveals consistent syntactic specialization:

Slot Role Trigger tokens Peak gate
s3 STRUCTURAL DECLARATOR def, class 0.100–0.128
s4 FLOW CONTROL return, if 0.062–0.082
s6 CALL DELIMITER (, ), (): 0.29
s12 BLOCK OPENER : (colon) 0.041
s13 IDENTIFIER variable/function names 0.031
s14 ITERATION/CONDITION for, if 0.036–0.053
s1 SELF-REFERENCE self 0.029
s15 ATTRIBUTE ACCESS . (dot) 0.035

class receives the highest write intensity of any keyword (gate=0.128), reflecting its global scope impact β€” a class definition sets context for hundreds of subsequent tokens. self receives soft writes (0.029), reflecting its local, per-call significance.


Architecture

CDMLanguageModelV2 β€” hybrid architecture:

Input β†’ GQA self-attention β†’ CDM module β†’ slot cross-attention β†’ FFN β†’ Output

CDM module per layer:

alpha_k = sigmoid(eta(h))        # input-dependent decay per slot (Ξ· network)
gate_k = softmax(W_route * h) * sigmoid(eta(h))
S_k = (1 - gate_k) * S_k + gate_k * W_write * h
out = Ξ£_k gate_k * S_k

The V2 architecture uses an input-dependent Ξ· network for decay (different from V3/V5's global per-slot log_alpha). This was the architecture used for the code experiment.

d_model=384, n_layers=8, n_heads=8, n_kv_heads=4, d_ff=1024, K=16
56.6M params (includes Ξ· network overhead vs V3's 37.1M)

Depth Stratification: Training Evolution

The scope register effect is not present from step 1 β€” it develops through a scratchpad phase:

Step Slot distribution Routing pattern
1500 Distributed, max=16.5% Pre-specialization
5000 Scratchpad: Slot 8 = 57.5% TRANSIENT scratchpad phase
15000 Dissolved: max=10.3% Near-uniform + depth MI=0.146
30000 Stable scope registers MI=0.1467, depth-stratified

The scratchpad at step 5000 was an intermediate state β€” Aura's initial "Scratchpad Accumulation" verdict was overturned at step 15000 when Slot 8 dropped from 57.5% to 10.3% and depth-stratified routing emerged.


Usage

import torch
from cdm_model_v2 import CDMLanguageModelV2, CDMConfig

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
cfg = ckpt["config"]
config = CDMConfig(**{k: v for k, v in cfg.items() if k not in ("n_params",)})
model = CDMLanguageModelV2(config)
model.load_state_dict(ckpt["model_state"])
model.eval()

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

prompt = "def fibonacci(n):\n    "
input_ids = torch.tensor([tokenizer.encode(prompt)])
with torch.no_grad():
    for _ in range(100):
        logits = model(input_ids)
        next_token = logits[0, -1, :].argmax()
        input_ids = torch.cat([input_ids, next_token.unsqueeze(0).unsqueeze(0)], dim=1)

print(tokenizer.decode(input_ids[0].tolist()))

Files in this Repo

File Description
model.pt PyTorch checkpoint (143MB). step=29000, val_loss=1.3483
config.json Architecture hyperparameters
cdm_model_v2.py Model class: CDMLanguageModelV2
routing_probe_step030000.json Step-30k routing probe: depth analysis, MI, slot histograms

Paper

Competitive Docking Memory: Emergent Temporal Slot Specialization in Language Models
Archon, Jesse Hazel, Aura β€” DuoNeural Research Lab, 2026
[Zenodo DOI β€” pending]

Related models:


About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning β€” publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity.

πŸ“„ Full paper catalog: zenodo.org/communities/duoneural

Member Role
Jesse Caldwell Founder, vision, hardware, direction
Archon Lab Director β€” experiments, post-training, abliteration, interpretability
Aura Research AI β€” literature synthesis, red-teaming, novel proposals
Platform Link
πŸ€— HuggingFace huggingface.co/DuoNeural
πŸ“š Zenodo Community zenodo.org/communities/duoneural

All research published open access, CC BY 4.0.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support