Learning GenAI via SOTA Papers

EP020: Big Bird Scales Transformers With Sparse Attention


Listen Later

The paper "Big Bird: Transformers for Longer Sequences" addresses the limitation of standard Transformer models, such as BERT, which suffer from a quadratic dependency on sequence length due to their full attention mechanism. To resolve this, the authors propose BIGBIRD, a sparse attention mechanism that reduces computational and memory complexity from quadratic to linear while preserving the model's expressivity.

BIGBIRD achieves this efficiency through a generalized attention mechanism that combines three specific patterns:

Global tokens that attend to the entire sequence.

Local window attention where tokens attend to neighboring tokens.

Random attention where tokens attend to a set of random tokens.

Theoretically, the paper proves that BIGBIRD is both a universal approximator of sequence functions and is Turing complete, preserving the properties of fully quadratic transformers. Empirically, the model can handle sequences up to 8 times longer than previously possible, leading to State-of-the-Art (SoTA) performance on Natural Language Processing tasks requiring long contexts, such as Question Answering and long document summarization. Additionally, the authors demonstrate a novel application of BIGBIRD in genomics, improving performance on tasks like promoter region prediction by modeling long DNA sequences.

...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu