This research paper proposes a new method for efficiently training linear transformers, which are a type of neural network that uses linear attention to process sequences of data. Unlike traditional transformers, which have quadratic complexity in sequence length, linear transformers can process long sequences in linear time, making them more efficient for certain tasks. However, existing linear transformers have been shown to struggle with tasks that require long-range dependencies or the ability to retrieve information from a large context. The authors address this limitation by introducing a novel algorithm called
DeltaNet, which utilizes a delta rule-like update to improve associative recall over long contexts.
DeltaNet is parallelized across sequence length using a memory-efficient representation for computing products of Householder matrices, making it suitable for training on modern hardware. The authors demonstrate that
DeltaNet outperforms other linear-time baselines, particularly on recall-intensive tasks, and that
DeltaNet can also be effectively combined with other types of attention mechanisms to create hybrid models that achieve even better performance.