We cover two new innovations from Microsoft extending ideas from the original old **FlashAttention**. Flash Attention is an IO-aware attention algorithm for Transformers designed to address the quadratic time and memory complexity of standard self-attention on long sequences. By using **tiling and recomputation** to minimize slow **High Bandwidth Memory (HBM)** accesses in favor of fast **on-chip SRAM**, FlashAttention achieves significant wall-clock speedups for training models like BERT and GPT-2, enabling them to handle much longer context lengths. Microsoft's new **ATTENTION2D** is a technique that builds upon memory-efficient methods like FlashAttention to optimize **distributed self-attention** across multiple GPUs, achieving parallelism in two dimensions (Q-DIM and KV-DIM) to overcome the communication bottleneck inherent in prior single-dimension parallel approaches like Ring Attention. Microsoft's additional contribution to the research community is **Lean Attention**, which also appears to propose a high-performance, tiled execution strategy for attention, using shared memory and iterative computation, similar to the IO-aware concepts in the other sources.
Sources:
The original flag attention paper:
https://arxiv.org/pdf/2205.14135
Flash attention 2 paper:
https://arxiv.org/pdf/2307.08691
June 28, 2025 Microsoft's Attention2D:
https://arxiv.org/pdf/2503.15758
Microsoft's Lean attention:
https://www.microsoft.com/en-us/research/wp-content/uploads/2024/05/Lean_Attention___arxiv_version.pdf