Learning GenAI via SOTA Papers

EP113: How FlashAttention-3 Doubles H100 Speed


Listen Later

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision addresses the core computational bottleneck of the attention mechanism in Transformer models, specifically targeting the underutilization of newer hardware like the Hopper H100 GPU. While its predecessor, FlashAttention-2, achieved only 35% utilization on the H100, FlashAttention-3 introduces three main techniques to significantly boost performance:

  • Producer-Consumer Asynchrony: It utilizes a warp-specialized software pipelining scheme to overlap data movement and overall computation, effectively hiding memory and instruction latencies.
  • Hiding Softmax: The algorithm uses pingpong scheduling and a 2-stage pipeline to interleave comparatively slower, non-matmul operations (like softmax) with asynchronous matrix multiplications (GEMMs).
  • Hardware-Accelerated Low-Precision (FP8): It adapts the forward pass algorithm to leverage FP8 Tensor Cores. To combat the higher numerical error and outlier features typically associated with lower-precision FP8, it employs block quantization (scaling per block rather than per tensor) and incoherent processing (multiplying queries and keys with a random orthogonal matrix to spread out outlier values).

Key Results:FlashAttention-3 achieves a 1.5x to 2.0x speedup over FlashAttention-2 on H100 GPUs. In FP16, it reaches up to 740 TFLOPs/s (75% hardware utilization), and with FP8, it reaches close to 1.2 PFLOPs/s. Furthermore, thanks to its error-mitigation techniques, the FP8 implementation achieves 2.6x lower numerical error than baseline FP8 attention.

...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu