March 10, 2026

Flash Attention 4

15 minutes

In this episode:
• Welcome to the Hardware Lottery: Professor Norris and Linda introduce the episode's focus: FlashAttention-4. They set the stage by discussing the arrival of NVIDIA's Blackwell architecture and why existing optimization techniques suddenly hit a wall.
• The Asymmetry Problem: Linda explains the concept of 'Asymmetric Hardware Scaling' found in the B200 GPUs, where tensor cores doubled in speed but memory bandwidth and special function units didn't. Norris questions why simply running FlashAttention-3 isn't good enough.
• Bottlenecks in the Forward Pass: The duo dives into the algorithmic changes for the forward pass, specifically how the paper mitigates the 'exponential unit' bottleneck by emulating exponential functions on FMA units and using conditional softmax rescaling.
• Taming the Backward Pass with TMEM: A deep dive into the backward pass optimizations. Linda explains the use of Tensor Memory (TMEM) and the '2-CTA MMA' mode to reduce shared memory traffic, satisfying Norris's curiosity about how to hide latency.
• Escaping Template Hell: They discuss the implementation framework: CuTe-DSL embedded in Python. Norris rejoices at the reduction in compile times compared to C++ templates, while Linda highlights the flexibility for researchers.
• The Verdict: The hosts wrap up the findings, noting the impressive speedups over cuDNN and Triton, and offer final thoughts on the future of hardware-aware algorithm design.

...more

View all episodes

By Mechanical Dirk

March 10, 2026

Flash Attention 4

15 minutes

...more

Share Flash Attention 4

Sign up to save your podcasts

Flash Attention 4

Flash Attention 4