Ref: https://arxiv.org/abs/2004.05150
paper introduces Longformer, a Transformer model designed to
efficiently process long sequences. It addresses the quadratic
complexity of standard self-attention by using a linear-scaling
mechanism combining local windowed attention and task-motivated global
attention. The authors demonstrate Longformer's effectiveness on
character-level language modeling and various downstream tasks,
achieving state-of-the-art results. Furthermore, they introduce
Longformer-Encoder-Decoder (LED), a variant for sequence-to-sequence
tasks, showcasing its success in long document summarization. The
improved efficiency and performance are achieved through architectural
modifications and strategic training procedures.