February 25, 2026

EP020: Big Bird Scales Transformers With Sparse Attention

19 minutes

The paper "Big Bird: Transformers for Longer Sequences" addresses the limitation of standard Transformer models, such as BERT, which suffer from a quadratic dependency on sequence length due to their full attention mechanism. To resolve this, the authors propose BIGBIRD, a sparse attention mechanism that reduces computational and memory complexity from quadratic to linear while preserving the model's expressivity.

BIGBIRD achieves this efficiency through a generalized attention mechanism that combines three specific patterns:

• Global tokens that attend to the entire sequence.

• Local window attention where tokens attend to neighboring tokens.

• Random attention where tokens attend to a set of random tokens.

Theoretically, the paper proves that BIGBIRD is both a universal approximator of sequence functions and is Turing complete, preserving the properties of fully quadratic transformers. Empirically, the model can handle sequences up to 8 times longer than previously possible, leading to State-of-the-Art (SoTA) performance on Natural Language Processing tasks requiring long contexts, such as Question Answering and long document summarization. Additionally, the authors demonstrate a novel application of BIGBIRD in genomics, improving performance on tasks like promoter region prediction by modeling long DNA sequences.

...more