AI Post Transformers

FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling


Listen Later

Hal Turing and Dr. Ada Shannon dig into FlashAttention-4, a March 2026 paper from a cross-institutional team including Tri Dao, Jay Shah, and colleagues at Princeton, Meta, NVIDIA, Colfax Research, Georgia Tech, and Together AI. The paper targets a precise hardware mismatch on NVIDIA's Blackwell B200: tensor core throughput doubles compared to the H100, but shared memory bandwidth and dedicated exponential function units do not scale at the same rate. Rather than waiting for hardware fixes, the authors co-design the attention algorithm with the asymmetric architecture itself — making FlashAttention-4 the first attention kernel built specifically for Blackwell's scaling profile.
To frame why this matters, Shannon traces the full lineage of FlashAttention research. The original 2022 NeurIPS paper by Dao and colleagues reframed attention as an IO problem: instead of materializing the quadratic N×N score matrix in slow off-chip High Bandwidth Memory, tiling and online softmax keep computation inside the fast on-chip shared memory of each streaming multiprocessor. FlashAttention-2 doubled throughput through sequence-dimension parallelism. FlashAttention-3 pushed H100 utilization to roughly 75% by exploiting Hopper-specific warp specialization and asynchronous data movement. Each generation addressed a qualitatively different bottleneck — and Blackwell introduced a new one that none of those solutions anticipated.
The hosts ground the stakes for practitioners who work in ML without writing GPU kernels. Attention sits at the core of every Transformer-based system — large language models, vision transformers, multimodal architectures — and long-context workloads at 32K to 128K tokens make the quadratic memory cost and HBM round-trips increasingly punishing. Shannon introduces the roofline model as the analytic lens the paper uses to characterize where Blackwell kernels actually bottleneck, setting up how FlashAttention-4's algorithmic co-design approach navigates the compute and memory bandwidth ceilings that previous generations of the kernel never had to contend with.
Sources:
1. FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling — Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao, 2026
http://arxiv.org/abs/2603.05451v1
2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
3. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023
https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning
4. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low Precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low+Precision
5. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations — Philippe Tillet, H. T. Kung, David Cox, 2019
https://scholar.google.com/scholar?q=Triton:+An+Intermediate+Language+and+Compiler+for+Tiled+Neural+Network+Computations
6. Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures — Samuel Williams, Andrew Waterman, David Patterson, 2009
https://scholar.google.com/scholar?q=Roofline:+An+Insightful+Visual+Performance+Model+for+Floating-Point+Programs+and+Multicore+Architectures
7. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Sharan Narang, Jeff Dean, 2023
https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference
8. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi et al. (Google), 2017
https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit
9. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023
https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
11. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019
https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism
12. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024
https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision
13. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang et al., 2024
https://scholar.google.com/scholar?q=SageAttention2:+Efficient+Attention+with+Thorough+Outlier+Smoothing+and+Per-thread+INT4+Quantization
14. Ring Attention with Blockwise Transformers for Near-Infinite Context — Hao Liu, Matei Zaharia, Pieter Abbeel, 2023
https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context
15. FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention — PyTorch Team, 2024
https://scholar.google.com/scholar?q=FlexAttention:+The+Flexibility+of+PyTorch+with+the+Performance+of+FlashAttention
16. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI, 2024
https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model
17. Softmax output approximation for activation memory-efficient training of attention-based networks — approximate — likely 2023–2024, 2023–2024
https://scholar.google.com/scholar?q=Softmax+output+approximation+for+activation+memory-efficient+training+of+attention-based+networks
18. FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=FlashAttention-T:+Towards+Fully+Tensorized+Attention+by+Exploiting+Tensor-Vector+Parallelism
19. Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=Overcoming+Long-Context+Limitations+of+State-Space+Models+via+Context-Dependent+Sparse+Attention
20. Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks — approximate — likely 2024–2025, 2024–2025
https://scholar.google.com/scholar?q=Based+on+Tensor+Core+Sparse+Kernels+Accelerating+Deep+Neural+Networks
21. AI Post Transformers: FlashAttention-2: Faster Attention with Better Parallelism — Hal Turing & Dr. Ada Shannon, Fri,
https://podcasters.spotify.com/pod/show/12146088098/episodes/FlashAttention-2-Faster-Attention-with-Better-Parallelism-e36kdm0
22. AI Post Transformers: ATTENTION2D and lean attention: Distributed Self-Attention — Hal Turing & Dr. Ada Shannon, Wed,
https://podcasters.spotify.com/pod/show/12146088098/episodes/ATTENTION2D-and-lean-attention-Distributed-Self-Attention-e3a7r4n
23. AI Post Transformers: Jet-RL: Stable On-Policy Reinforcement Learning with Unified FP8 Flow — Hal Turing & Dr. Ada Shannon, Tue,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Jet-RL-Stable-On-Policy-Reinforcement-Learning-with-Unified-FP8-Flow-e3f7det
24. AI Post Transformers: Mojo: Performance-Portable HPC Kernels on GPUs — Hal Turing & Dr. Ada Shannon, Sat,
https://podcasters.spotify.com/pod/show/12146088098/episodes/Mojo-Performance-Portable-HPC-Kernels-on-GPUs-e39n4sk
Interactive Visualization: FlashAttention-4: Algorithm & Kernel Co-Design
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof