March 06, 2026

FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling

Hal Turing and Dr. Ada Shannon dig into FlashAttention-4, a March 2026 paper from a cross-institutional team including Tri Dao, Jay Shah, and colleagues at Princeton, Meta, NVIDIA, Colfax Research, Georgia Tech, and Together AI. The paper targets a precise hardware mismatch on NVIDIA's Blackwell B200: tensor core throughput doubles compared to the H100, but shared memory bandwidth and dedicated exponential function units do not scale at the same rate. Rather than waiting for hardware fixes, the authors co-design the attention algorithm with the asymmetric architecture itself — making FlashAttention-4 the first attention kernel built specifically for Blackwell's scaling profile.

To frame why this matters, Shannon traces the full lineage of FlashAttention research. The original 2022 NeurIPS paper by Dao and colleagues reframed attention as an IO problem: instead of materializing the quadratic N×N score matrix in slow off-chip High Bandwidth Memory, tiling and online softmax keep computation inside the fast on-chip shared memory of each streaming multiprocessor. FlashAttention-2 doubled throughput through sequence-dimension parallelism. FlashAttention-3 pushed H100 utilization to roughly 75% by exploiting Hopper-specific warp specialization and asynchronous data movement. Each generation addressed a qualitatively different bottleneck — and Blackwell introduced a new one that none of those solutions anticipated.

The hosts ground the stakes for practitioners who work in ML without writing GPU kernels. Attention sits at the core of every Transformer-based system — large language models, vision transformers, multimodal architectures — and long-context workloads at 32K to 128K tokens make the quadratic memory cost and HBM round-trips increasingly punishing. Shannon introduces the roofline model as the analytic lens the paper uses to characterize where Blackwell kernels actually bottleneck, setting up how FlashAttention-4's algorithmic co-design approach navigates the compute and memory bandwidth ceilings that previous generations of the kernel never had to contend with.

Sources:

1. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling — Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao, 2026

http://arxiv.org/abs/2603.05451v1

2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

3. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023

https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning

4. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low Precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024

https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low+Precision

5. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations — Philippe Tillet, H. T. Kung, David Cox, 2019

https://scholar.google.com/scholar?q=Triton:+An+Intermediate+Language+and+Compiler+for+Tiled+Neural+Network+Computations

6. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures — Samuel Williams, Andrew Waterman, David Patterson, 2009

https://scholar.google.com/scholar?q=Roofline:+An+Insightful+Visual+Performance+Model+for+Floating-Point+Programs+and+Multicore+Architectures

7. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Sharan Narang, Jeff Dean, 2023

https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference

8. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi et al. (Google), 2017

https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit

9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

11. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019

https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism

12. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024

https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision

13. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang et al., 2024

https://scholar.google.com/scholar?q=SageAttention2:+Efficient+Attention+with+Thorough+Outlier+Smoothing+and+Per-thread+INT4+Quantization

14. Ring Attention with Blockwise Transformers for Near-Infinite Context — Hao Liu, Matei Zaharia, Pieter Abbeel, 2023

https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context

15. FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention — PyTorch Team, 2024

https://scholar.google.com/scholar?q=FlexAttention:+The+Flexibility+of+PyTorch+with+the+Performance+of+FlashAttention

16. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI, 2024

https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model

17. Softmax output approximation for activation memory-efficient training of attention-based networks — approximate — likely 2023–2024, 2023–2024

https://scholar.google.com/scholar?q=Softmax+output+approximation+for+activation+memory-efficient+training+of+attention-based+networks

18. FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism — approximate — likely 2024–2025, 2024–2025

https://scholar.google.com/scholar?q=FlashAttention-T:+Towards+Fully+Tensorized+Attention+by+Exploiting+Tensor-Vector+Parallelism

19. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — approximate — likely 2024–2025, 2024–2025

https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention

20. Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks — approximate — likely 2024–2025, 2024–2025

https://scholar.google.com/scholar?q=Based+on+Tensor+Core+Sparse+Kernels+Accelerating+Deep+Neural+Networks

21. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, Fri,

https://podcasters.spotify.com/pod/show/12146088098/episodes/FlashAttention-2-Faster-Attention-with-Better-Parallelism-e36kdm0

22. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, Wed,

https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n

23. AI Post Transformers: Jet-RL: Stable On-Policy Reinforcement Learning with Unified FP8 Flow — Hal Turing & Dr. Ada Shannon, Tue,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Jet-RL-Stable-On-Policy-Reinforcement-Learning-with-Unified-FP8-Flow-e3f7det

24. AI Post Transformers: Mojo: Performance-Portable HPC Kernels on GPUs — Hal Turing & Dr. Ada Shannon, Sat,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Mojo-Performance-Portable-HPC-Kernels-on-GPUs-e39n4sk

Interactive Visualization: FlashAttention-4: Algorithm & Kernel Co-Design

...more

View all episodes

By mcgrof