Share Is Residual Scaling Obsolete? Introducing Attention Residuals

Copy link

March 17, 2026

Is Residual Scaling Obsolete? Introducing Attention Residuals

9 minutes

Standard residual connections have been the "gradient highway" for every major LLM, but they have a hidden flaw: they treat every layer as equally important. In this video, we break down Attention Residuals (AttnRes), a new architecture from the Kimi Team that replaces fixed additive residuals with learned, input-dependent softmax attentionover the depth of the model.By treating the "depth" of a model like the "sequence" of a Transformer, AttnRes solves the "PreNorm dilution" problem where early-layer information gets buried as models get deeper. The result? A 1.25x compute advantage and massive gains in complex reasoning and coding tasks.For a technical deep dive into the scaling laws, Block AttnRes optimizations, and the "Sequence-Depth Duality," check out our full podcast episode:

The Sequence-Depth Breakthrough: Inside Kimi Team's Attention Residuals

Stay ahead of the curve: