AI Post Transformers

ATTENTION2D and lean attention: Distributed Self-Attention


Listen Later

We cover two new innovations from Microsoft extending ideas from the original old FlashAttention. Flash Attention is an IO-aware attention algorithm for Transformers designed to address the quadratic time and memory complexity of standard self-attention on long sequences. By using tiling and recomputation to minimize slow High Bandwidth Memory (HBM) accesses in favor of fast on-chip SRAM, FlashAttention achieves significant wall-clock speedups for training models like BERT and GPT-2, enabling them to handle much longer context lengths. Microsoft's new ATTENTION2D is a technique that builds upon memory-efficient methods like FlashAttention to optimize distributed self-attention across multiple GPUs, achieving parallelism in two dimensions (Q-DIM and KV-DIM) to overcome the communication bottleneck inherent in prior single-dimension parallel approaches like Ring Attention. Microsoft's additional contribution to the research community is Lean Attention, which also appears to propose a high-performance, tiled execution strategy for attention, using shared memory and iterative computation, similar to the IO-aware concepts in the other sources.Sources:The original flag attention paper:https://arxiv.org/pdf/2205.14135Flash attention 2 paper:https://arxiv.org/pdf/2307.08691June 28, 2025 Microsoft's Attention2D:https://arxiv.org/pdf/2503.15758Microsoft's Lean attention:https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/Lean_Attention___arxiv_version.pdf
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof