
Sign up to save your podcasts
Or


The paper "Longformer: The Long-Document Transformer" addresses the computational inefficiency of standard Transformer models, which scale quadratically with sequence length and are consequently unable to process long documents. To resolve this, the authors introduce a modified attention mechanism that scales linearly with sequence length, combining a local sliding window attention with a task-specific global attention.
This architecture allows the model to process sequences of thousands of tokens—typically up to 4,096, though up to 32,000 in specific experiments—while maintaining the ability to capture both local context and long-range dependencies. Longformer acts as a drop-in replacement for the self-attention mechanism in pretrained models like BERT and RoBERTa, consistently outperforming RoBERTa on long-document tasks such as question answering, coreference resolution, and classification.
Key achievements highlighted in the paper include:
• State-of-the-Art Results: The model set new benchmarks for character-level language modeling (text8 and enwik8) and specific downstream tasks like WikiHop and TriviaQA.
• Longformer-Encoder-Decoder (LED): The authors also introduced a variant called LED designed for generative sequence-to-sequence tasks, which proved effective for long document summarization on the arXiv dataset.
By Yun WuThe paper "Longformer: The Long-Document Transformer" addresses the computational inefficiency of standard Transformer models, which scale quadratically with sequence length and are consequently unable to process long documents. To resolve this, the authors introduce a modified attention mechanism that scales linearly with sequence length, combining a local sliding window attention with a task-specific global attention.
This architecture allows the model to process sequences of thousands of tokens—typically up to 4,096, though up to 32,000 in specific experiments—while maintaining the ability to capture both local context and long-range dependencies. Longformer acts as a drop-in replacement for the self-attention mechanism in pretrained models like BERT and RoBERTa, consistently outperforming RoBERTa on long-document tasks such as question answering, coreference resolution, and classification.
Key achievements highlighted in the paper include:
• State-of-the-Art Results: The model set new benchmarks for character-level language modeling (text8 and enwik8) and specific downstream tasks like WikiHop and TriviaQA.
• Longformer-Encoder-Decoder (LED): The authors also introduced a variant called LED designed for generative sequence-to-sequence tasks, which proved effective for long document summarization on the arXiv dataset.