March 07, 2026

EP114: FlashAttention-4 Solves Blackwell Hardware Bottlenecks

19 minutes

The podcast will dive deep into the featured paper: "FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling".

Here are some of the key concepts the hosts will explore:

The Shift to Blackwell GPUs: As the AI industry rapidly transitions to Blackwell-based systems like the B200 and GB200, hardware scaling has become highly asymmetric. While tensor core throughput has doubled compared to the Hopper H100 architecture, other functional units like shared memory bandwidth and exponential units have scaled more slowly, shifting the computational bottlenecks.
Algorithmic and Kernel Co-Design: To address these new bottlenecks, FlashAttention-4 introduces redesigned pipelines that exploit fully asynchronous matrix multiply-accumulate (MMA) operations to maximize overlap between tensor cores, softmax computation, and memory.
Mitigating Bottlenecks: The hosts will discuss innovative solutions like software-emulated exponential functions using polynomial approximation to increase exponential throughput, as well as the use of the new 256 KB tensor memory (TMEM) and 2-CTA MMA modes to significantly reduce shared memory traffic.
Performance Gains: The podcast will highlight FlashAttention-4's impressive results, including achieving up to 1613 TFLOPs/s (71% utilization) and delivering up to a 1.3× speedup over cuDNN 9.13 and a 2.7× speedup over Triton on B200 GPUs.

...more

View all episodes

By Yun Wu

March 07, 2026

EP114: FlashAttention-4 Solves Blackwell Hardware Bottlenecks

19 minutes

The podcast will dive deep into the featured paper: "FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling".

Here are some of the key concepts the hosts will explore:

The Shift to Blackwell GPUs: As the AI industry rapidly transitions to Blackwell-based systems like the B200 and GB200, hardware scaling has become highly asymmetric. While tensor core throughput has doubled compared to the Hopper H100 architecture, other functional units like shared memory bandwidth and exponential units have scaled more slowly, shifting the computational bottlenecks.
Algorithmic and Kernel Co-Design: To address these new bottlenecks, FlashAttention-4 introduces redesigned pipelines that exploit fully asynchronous matrix multiply-accumulate (MMA) operations to maximize overlap between tensor cores, softmax computation, and memory.
Mitigating Bottlenecks: The hosts will discuss innovative solutions like software-emulated exponential functions using polynomial approximation to increase exponential throughput, as well as the use of the new 256 KB tensor memory (TMEM) and 2-CTA MMA modes to significantly reduce shared memory traffic.
Performance Gains: The podcast will highlight FlashAttention-4's impressive results, including achieving up to 1613 TFLOPs/s (71% utilization) and delivering up to a 1.3× speedup over cuDNN 9.13 and a 2.7× speedup over Triton on B200 GPUs.

...more

Share EP114: FlashAttention-4 Solves Blackwell Hardware Bottlenecks

Sign up to save your podcasts

EP114: FlashAttention-4 Solves Blackwell Hardware Bottlenecks

EP114: FlashAttention-4 Solves Blackwell Hardware Bottlenecks