Ref: https://arxiv.org/abs/1901.02860
paper introduces Transformer-XL, a novel neural architecture for
language modeling that overcomes the limitations of fixed-length
contexts in standard Transformer models. It achieves this through a
segment-level recurrence mechanism and a novel relative positional
encoding scheme, enabling the capture of significantly longer-term
dependencies. The resulting model demonstrates state-of-the-art
performance on various language modeling benchmarks, exhibiting
substantial speed improvements during evaluation and the ability to
generate coherent long-form text. The authors present experimental
results and ablation studies validating the effectiveness of their
proposed techniques. They also offer insights into the attention