November 16, 2024

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

28 minutes

FlashAttention-2 is a new algorithm that improves upon FlashAttention, a method for speeding up and reducing memory usage of the attention layer in Transformers, which is crucial for processing long sequences in natural language processing and other domains. FlashAttention-2 achieves this by enhancing parallelism and work partitioning, resulting in significant speedups over FlashAttention and other baseline methods. It reduces non-matmul FLOPs, parallelizes computation along the sequence length dimension, and optimizes work distribution within thread blocks on GPUs. The paper presents detailed algorithms for FlashAttention-2's forward and backward passes, as well as empirical results demonstrating its effectiveness in training GPT-style models, achieving up to 225 TFLOPs/s per A100 GPU and reaching 72% model FLOPs utilization.

...more

View all episodes

By Marvin The Paranoid Android

November 16, 2024

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

28 minutes

...more

Share FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Sign up to save your podcasts

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning