
Sign up to save your podcasts
Or


The paper introduces Transformer-XL, a novel neural architecture designed to overcome the limitations of fixed-length contexts in standard Transformers, which often lead to context fragmentation and an inability to capture long-term dependencies. To address these issues, the authors propose two key technical innovations: a segment-level recurrence mechanism and a novel relative positional encoding scheme.
The recurrence mechanism enables the model to reuse hidden states from previous segments as an extended context, allowing information to propagate across segments. To prevent temporal confusion when reusing these states, the relative positional encoding replaces absolute positions with dynamic relative distances, which also allows the model to generalize to much longer sequences during evaluation than those seen during training.
Experimental results demonstrate that Transformer-XL:
• Captures dependencies 80% longer than RNNs and 450% longer than vanilla Transformers.
• Is up to 1,800+ times faster than vanilla Transformers during evaluation.
• Achieves state-of-the-art results on five major benchmarks, including WikiText-103, enwiki8, and One Billion Word.
• Generates highly coherent, novel text articles consisting of thousands of tokens.
By Yun WuThe paper introduces Transformer-XL, a novel neural architecture designed to overcome the limitations of fixed-length contexts in standard Transformers, which often lead to context fragmentation and an inability to capture long-term dependencies. To address these issues, the authors propose two key technical innovations: a segment-level recurrence mechanism and a novel relative positional encoding scheme.
The recurrence mechanism enables the model to reuse hidden states from previous segments as an extended context, allowing information to propagate across segments. To prevent temporal confusion when reusing these states, the relative positional encoding replaces absolute positions with dynamic relative distances, which also allows the model to generalize to much longer sequences during evaluation than those seen during training.
Experimental results demonstrate that Transformer-XL:
• Captures dependencies 80% longer than RNNs and 450% longer than vanilla Transformers.
• Is up to 1,800+ times faster than vanilla Transformers during evaluation.
• Achieves state-of-the-art results on five major benchmarks, including WikiText-103, enwiki8, and One Billion Word.
• Generates highly coherent, novel text articles consisting of thousands of tokens.