June 09, 2025

FlashAttention for Large Language Models

22 minutes

Discusses FlashAttention, an IO-aware algorithm designed to optimize the attention mechanism in Large Language Models (LLMs). It explains how standard attention suffers from quadratic complexity and becomes a memory bottleneck on GPUs due to excessive data transfers between slow HBM and fast SRAM.

FlashAttention addresses this by employing techniques like tiling, kernel fusion, online softmax, and recomputation to significantly reduce memory usage (achieving linear scaling) and increase speed, enabling LLMs to handle much longer sequences.

The text also covers the evolution through FlashAttention-2 and FlashAttention-3, which leverage enhanced parallelism and new hardware features, as well as various specialized variants and the widespread integration into popular frameworks like PyTorch and Hugging Face.

...more

View all episodes

By Benjamin Alloul 🗪 🅽🅾🆃🅴🅱🅾🅾🅺🅻🅼

June 09, 2025

FlashAttention for Large Language Models

22 minutes

...more

Share FlashAttention for Large Language Models

Sign up to save your podcasts

FlashAttention for Large Language Models

FlashAttention for Large Language Models