Mechanical Dreams

Attention Residuals


Listen Later

In this episode:
• The PreNorm Dilution Problem: Professor Norris and Linda introduce the episode and discuss the fundamental limitations of standard residual connections, focusing on the unbounded magnitude growth caused by PreNorm.
• Attention Residuals and the Time-Depth Duality: Linda introduces the core concept of Full Attention Residuals, treating network depth like sequence length. Professor Norris raises concerns about the memory and communication overhead.
• Block Attention Residuals: The hosts discuss how the Kimi Team solves the overhead problem by partitioning layers into blocks, reducing the cost while preserving the benefits of selective aggregation.
• Infrastructure and System Optimizations: A deep dive into the engineering feats that make Block AttnRes practical, including cross-stage caching for pipeline parallelism and a two-phase computation strategy for inference.
• Results, Scaling Laws, and Wrap-up: Linda shares the impressive scaling law results and downstream benchmark improvements. The hosts reflect on how AttnRes bounds hidden-state magnitudes.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk