March 07, 2026

EP113: How FlashAttention-3 Doubles H100 Speed

18 minutes

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision addresses the core computational bottleneck of the attention mechanism in Transformer models, specifically targeting the underutilization of newer hardware like the Hopper H100 GPU. While its predecessor, FlashAttention-2, achieved only 35% utilization on the H100, FlashAttention-3 introduces three main techniques to significantly boost performance:

Producer-Consumer Asynchrony: It utilizes a warp-specialized software pipelining scheme to overlap data movement and overall computation, effectively hiding memory and instruction latencies.
Hiding Softmax: The algorithm uses pingpong scheduling and a 2-stage pipeline to interleave comparatively slower, non-matmul operations (like softmax) with asynchronous matrix multiplications (GEMMs).
Hardware-Accelerated Low-Precision (FP8): It adapts the forward pass algorithm to leverage FP8 Tensor Cores. To combat the higher numerical error and outlier features typically associated with lower-precision FP8, it employs block quantization (scaling per block rather than per tensor) and incoherent processing (multiplying queries and keys with a random orthogonal matrix to spread out outlier values).

Key Results:FlashAttention-3 achieves a 1.5x to 2.0x speedup over FlashAttention-2 on H100 GPUs. In FP16, it reaches up to 740 TFLOPs/s (75% hardware utilization), and with FP8, it reaches close to 1.2 PFLOPs/s. Furthermore, thanks to its error-mitigation techniques, the FP8 implementation achieves 2.6x lower numerical error than baseline FP8 attention.

...more

View all episodes

By Yun Wu

March 07, 2026

EP113: How FlashAttention-3 Doubles H100 Speed

18 minutes

Producer-Consumer Asynchrony: It utilizes a warp-specialized software pipelining scheme to overlap data movement and overall computation, effectively hiding memory and instruction latencies.
Hiding Softmax: The algorithm uses pingpong scheduling and a 2-stage pipeline to interleave comparatively slower, non-matmul operations (like softmax) with asynchronous matrix multiplications (GEMMs).
Hardware-Accelerated Low-Precision (FP8): It adapts the forward pass algorithm to leverage FP8 Tensor Cores. To combat the higher numerical error and outlier features typically associated with lower-precision FP8, it employs block quantization (scaling per block rather than per tensor) and incoherent processing (multiplying queries and keys with a random orthogonal matrix to spread out outlier values).

...more

Share EP113: How FlashAttention-3 Doubles H100 Speed

Sign up to save your podcasts

EP113: How FlashAttention-3 Doubles H100 Speed

EP113: How FlashAttention-3 Doubles H100 Speed