March 30, 2026

EP137: Attention Residuals Solve the LLM Depth Bottleneck

22 minutes

The paper "Attention Residuals (AttnRes)" by the Kimi Team (MoonshotAI) proposes a novel replacement for the standard residual connections used in modern Large Language Models (LLMs).

Standard residual connections use fixed unit weights to sum all previous layer outputs, which leads to "uncontrolled hidden-state growth" and a "dilution" of each layer’s relative contribution as the model gets deeper. To solve this, the researchers introduce Attention Residuals, which replaces fixed additive accumulation with learned softmax attention over all preceding layer outputs. This allows each layer to selectively aggregate earlier representations using learned, input-dependent weights.

Because attending over every single previous layer (Full AttnRes) creates significant memory and communication overhead ($O(Ld)$) during large-scale training, the authors developed Block AttnRes. This variant:

Partitions layers into blocks (typically around 8 blocks).
Attends over block-level representations, reducing memory and communication costs to $O(Nd)$.
Functions as a practical drop-in replacement with minimal overhead: less than 4% for training and under 2% for inference latency.

Mitigates Dilution: AttnRes effectively manages hidden-state magnitudes and ensures a more uniform gradient distribution across the depth of the model.
Consistent Scaling: Scaling law experiments demonstrate that AttnRes consistently outperforms standard PreNorm baselines across various model sizes; Block AttnRes matched the loss of a baseline that used 1.25x more compute.
Performance Gains: When integrated into a 48B-parameter model (3B activated) and trained on 1.4T tokens, AttnRes improved performance across all evaluated downstream tasks, with particularly significant gains in multi-step reasoning, math, and coding.
Architecture Shifts: The study suggests that AttnRes allows models to exploit additional depth more effectively than conventional Transformer designs.

...more

View all episodes

By Yun Wu

March 30, 2026

EP137: Attention Residuals Solve the LLM Depth Bottleneck

22 minutes

The paper "Attention Residuals (AttnRes)" by the Kimi Team (MoonshotAI) proposes a novel replacement for the standard residual connections used in modern Large Language Models (LLMs).

Partitions layers into blocks (typically around 8 blocks).
Attends over block-level representations, reducing memory and communication costs to $O(Nd)$.
Functions as a practical drop-in replacement with minimal overhead: less than 4% for training and under 2% for inference latency.

Mitigates Dilution: AttnRes effectively manages hidden-state magnitudes and ensures a more uniform gradient distribution across the depth of the model.
Consistent Scaling: Scaling law experiments demonstrate that AttnRes consistently outperforms standard PreNorm baselines across various model sizes; Block AttnRes matched the loss of a baseline that used 1.25x more compute.
Performance Gains: When integrated into a 48B-parameter model (3B activated) and trained on 1.4T tokens, AttnRes improved performance across all evaluated downstream tasks, with particularly significant gains in multi-step reasoning, math, and coding.
Architecture Shifts: The study suggests that AttnRes allows models to exploit additional depth more effectively than conventional Transformer designs.

...more

Share EP137: Attention Residuals Solve the LLM Depth Bottleneck

Sign up to save your podcasts

EP137: Attention Residuals Solve the LLM Depth Bottleneck

EP137: Attention Residuals Solve the LLM Depth Bottleneck