
Sign up to save your podcasts
Or


The paper "Attention Residuals (AttnRes)" by the Kimi Team (MoonshotAI) proposes a novel replacement for the standard residual connections used in modern Large Language Models (LLMs).
Standard residual connections use fixed unit weights to sum all previous layer outputs, which leads to "uncontrolled hidden-state growth" and a "dilution" of each layer’s relative contribution as the model gets deeper. To solve this, the researchers introduce Attention Residuals, which replaces fixed additive accumulation with learned softmax attention over all preceding layer outputs. This allows each layer to selectively aggregate earlier representations using learned, input-dependent weights.
Because attending over every single previous layer (Full AttnRes) creates significant memory and communication overhead ($O(Ld)$) during large-scale training, the authors developed Block AttnRes. This variant:
By Yun WuThe paper "Attention Residuals (AttnRes)" by the Kimi Team (MoonshotAI) proposes a novel replacement for the standard residual connections used in modern Large Language Models (LLMs).
Standard residual connections use fixed unit weights to sum all previous layer outputs, which leads to "uncontrolled hidden-state growth" and a "dilution" of each layer’s relative contribution as the model gets deeper. To solve this, the researchers introduce Attention Residuals, which replaces fixed additive accumulation with learned softmax attention over all preceding layer outputs. This allows each layer to selectively aggregate earlier representations using learned, input-dependent weights.
Because attending over every single previous layer (Full AttnRes) creates significant memory and communication overhead ($O(Ld)$) during large-scale training, the authors developed Block AttnRes. This variant: