
Sign up to save your podcasts
Or


The paper introduces FlashAttention, a new algorithm designed to address the slow and memory-intensive nature of Transformer models when processing long sequences. Standard self-attention has a time and memory complexity that scales quadratically with sequence length, largely due to the massive overhead of reading and writing the large intermediate attention matrix to the GPU's relatively slow High Bandwidth Memory (HBM). While prior approximate attention methods tried to reduce compute requirements, they often failed to achieve actual wall-clock speedups because they ignored these memory access (IO) overheads.
To solve this, FlashAttention implements an IO-aware exact attention algorithm that drastically minimizes HBM accesses using two key techniques:
The authors also propose block-sparse FlashAttention, an extension that skips zero blocks in a sparse attention mask. This approximate attention algorithm further improves speed and reduces IO complexity by a factor proportional to the sparsity ratio.
Key Results & Impact:
By Yun WuThe paper introduces FlashAttention, a new algorithm designed to address the slow and memory-intensive nature of Transformer models when processing long sequences. Standard self-attention has a time and memory complexity that scales quadratically with sequence length, largely due to the massive overhead of reading and writing the large intermediate attention matrix to the GPU's relatively slow High Bandwidth Memory (HBM). While prior approximate attention methods tried to reduce compute requirements, they often failed to achieve actual wall-clock speedups because they ignored these memory access (IO) overheads.
To solve this, FlashAttention implements an IO-aware exact attention algorithm that drastically minimizes HBM accesses using two key techniques:
The authors also propose block-sparse FlashAttention, an extension that skips zero blocks in a sparse attention mask. This approximate attention algorithm further improves speed and reduces IO complexity by a factor proportional to the sparsity ratio.
Key Results & Impact: