
Sign up to save your podcasts
Or


The paper "Big Bird: Transformers for Longer Sequences" addresses the limitation of standard Transformer models, such as BERT, which suffer from a quadratic dependency on sequence length due to their full attention mechanism. To resolve this, the authors propose BIGBIRD, a sparse attention mechanism that reduces computational and memory complexity from quadratic to linear while preserving the model's expressivity.
BIGBIRD achieves this efficiency through a generalized attention mechanism that combines three specific patterns:
• Global tokens that attend to the entire sequence.
• Local window attention where tokens attend to neighboring tokens.
• Random attention where tokens attend to a set of random tokens.
Theoretically, the paper proves that BIGBIRD is both a universal approximator of sequence functions and is Turing complete, preserving the properties of fully quadratic transformers. Empirically, the model can handle sequences up to 8 times longer than previously possible, leading to State-of-the-Art (SoTA) performance on Natural Language Processing tasks requiring long contexts, such as Question Answering and long document summarization. Additionally, the authors demonstrate a novel application of BIGBIRD in genomics, improving performance on tasks like promoter region prediction by modeling long DNA sequences.
By Yun WuThe paper "Big Bird: Transformers for Longer Sequences" addresses the limitation of standard Transformer models, such as BERT, which suffer from a quadratic dependency on sequence length due to their full attention mechanism. To resolve this, the authors propose BIGBIRD, a sparse attention mechanism that reduces computational and memory complexity from quadratic to linear while preserving the model's expressivity.
BIGBIRD achieves this efficiency through a generalized attention mechanism that combines three specific patterns:
• Global tokens that attend to the entire sequence.
• Local window attention where tokens attend to neighboring tokens.
• Random attention where tokens attend to a set of random tokens.
Theoretically, the paper proves that BIGBIRD is both a universal approximator of sequence functions and is Turing complete, preserving the properties of fully quadratic transformers. Empirically, the model can handle sequences up to 8 times longer than previously possible, leading to State-of-the-Art (SoTA) performance on Natural Language Processing tasks requiring long contexts, such as Question Answering and long document summarization. Additionally, the authors demonstrate a novel application of BIGBIRD in genomics, improving performance on tasks like promoter region prediction by modeling long DNA sequences.