
Sign up to save your podcasts
Or


The paper "Linformer: Self-Attention with Linear Complexity" introduces a new Transformer architecture designed to overcome the efficiency bottlenecks of standard self-attention, which scales quadratically (O(n2)) with sequence length.
The authors' core insight is that the self-attention matrix is low-rank, meaning the stochastic matrix formed by attention can be approximated by a smaller matrix without significant loss of information. Based on this, they propose the Linformer, which uses linear projections to reduce the dimension of the Key and Value matrices, effectively lowering the self-attention complexity to linear time and space (O(n)).
Key findings include:
• Performance: The Linformer performs on par with standard Transformer models (like RoBERTa) on both pretraining and downstream tasks such as GLUE and IMDB.
• Efficiency: It offers significant improvements in inference speed and memory consumption, especially for very long sequences, where standard Transformers become prohibitively expensive.
• Theoretical Guarantee: The paper provides theoretical proof that self-attention can be approximated by a low-rank matrix with low error.
By Yun WuThe paper "Linformer: Self-Attention with Linear Complexity" introduces a new Transformer architecture designed to overcome the efficiency bottlenecks of standard self-attention, which scales quadratically (O(n2)) with sequence length.
The authors' core insight is that the self-attention matrix is low-rank, meaning the stochastic matrix formed by attention can be approximated by a smaller matrix without significant loss of information. Based on this, they propose the Linformer, which uses linear projections to reduce the dimension of the Key and Value matrices, effectively lowering the self-attention complexity to linear time and space (O(n)).
Key findings include:
• Performance: The Linformer performs on par with standard Transformer models (like RoBERTa) on both pretraining and downstream tasks such as GLUE and IMDB.
• Efficiency: It offers significant improvements in inference speed and memory consumption, especially for very long sequences, where standard Transformers become prohibitively expensive.
• Theoretical Guarantee: The paper provides theoretical proof that self-attention can be approximated by a low-rank matrix with low error.