CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability
Abstract
CART is a parameter-efficient language model that uses a shared core block with frozen key-value tensors and a stable LTI gate for recurrent processing, but performs worse than dense baselines at parameter parity.
We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully-trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64-configuration screen at 3,000 steps, then 36 configurations (P=6, R in {6,8,10}, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in {256,512,768,1024}: prelude depth P dominates loop count R, and the Stage-1 ranking of R reverses at full training (R=6 becomes best at d>=512). At the binding d=1024 parameter-parity test, CART does not beat a parameter-matched dense baseline, losing by 1-2% at stored-parameter parity and by ~10% at effective-parameter parity. Diagnostic ablations split the effective-parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent-core machinery (hyper-connections, LTI gate, loop-index embedding) is individually vestigial. Variable-R inference degrades on both sides of the trained R, a negative result for test-time depth scaling under this recipe.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior (2026)
- LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling (2026)
- Solve the Loop: Attractor Models for Language and Reasoning (2026)
- LT2: Linear-Time Looped Transformers (2026)
- State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning (2026)
- LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models (2026)
- Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.01495 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper