arxiv:2603.15031

Attention Residuals

Published on Mar 16

· Submitted by

taesiri on Mar 17

Moonshot AI

Upvote

Authors:

Yu Zhang ,

Abstract

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.

View arXiv page View PDF GitHub 1.81k Add to collection

Community

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

search-facility

about 18 hours ago

Interesting idea 👍

avahal

about 15 hours ago

block-attn residuals are the most interesting part here; replacing uniform depth-wise sums with learned, block-level attention feels like the right granularity to keep information flow stable without blasting memory. my main question is how block size and the number of blocks n trade off: did you run ablations across n ∈ {2,4,8} for a fixed depth, and is there a sweet spot where gains persist with modest memory overhead? the arxivLens breakdown helped me parse the details, especially the two-phase compute and the cache-based communication, which otherwise felt easy to underestimate. i’d be curious how the approach behaves on models with highly imbalanced layer budgets or in settings where some layers are more compute-heavy, to see if the learned attention still stabilizes training.

yzhangcs

Paper author about 10 hours ago

@avahal We chose 8 as the default value primarily to align with mHC standards while maximizing the number of blocks. Although increasing the block count generally improves performance, we found that for large-scale LLMs, this number is strictly constrained by communication overhead and memory pressure. Ultimately, 8 represents the optimal balance between these factors.

You can see the impact of various block sizes on the loss in the figure below.