Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
This paper introduces the Transformer, a novel neural network architecture based solely on attention mechanisms for sequence transduction tasks, particularly machine translation. The authors argue that traditional recurrent and convolutional models, while dominant, are limited by their sequential nature, hindering parallelization and the learning of long-range dependencies.
Most Important Ideas/Facts
The Transformer: This architecture abandons recurrence and convolution entirely, relying instead on multi-head self-attention to draw global dependencies between input and output elements. This allows for significantly more parallelization during training.Advantages over RNNs/CNNs:Parallelization: Transformers can process sequences in parallel, unlike inherently sequential RNNs, leading to faster training times.Long-Range Dependencies: Self-attention allows the model to directly attend to all positions in the sequence, making it easier to learn long-range dependencies compared to RNNs and CNNs, where the path length increases with distance.Scaled Dot-Product Attention: The paper introduces this specific type of attention, which computes attention weights based on scaled dot products between query and key vectors. This proves more efficient than additive attention while maintaining comparable performance.Multi-Head Attention: This mechanism allows the model to attend to information from different representation subspaces at different positions, overcoming the limitations of single-head attention.Positional Encoding: Since the Transformer lacks sequential information inherent in RNNs, the authors introduce positional encodings, using sine and cosine functions, to provide information about the relative or absolute position of tokens in the sequence.State-of-the-art Performance: The Transformer achieves new state-of-the-art BLEU scores on WMT 2014 English-to-German and English-to-French translation tasks. Notably, it significantly outperforms previous models, including ensembles, on the English-to-German task, achieving a BLEU score of 28.4.Faster Training: The Transformer's parallelizable nature enables significantly faster training times compared to RNN- or CNN-based models. The authors report training times of 12 hours for the base model and 3.5 days for the larger model.Generalizability: The paper demonstrates the Transformer's generalizability by successfully applying it to English constituency parsing, a task with different challenges than machine translation."We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.""The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.""In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.""Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.""Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence."The authors highlight potential future research directions, including:
Applying the Transformer to tasks involving modalities beyond text (e.g., images, audio, video).Exploring local, restricted attention mechanisms for handling large inputs and outputs efficiently.Making the generation process less sequential.https://arxiv.org/abs/1706.03762