AI Post Transformers

Parallelizing DeltaNet Linear Transformers over Sequence Length


Listen Later

This episode explores a June 2024 paper that redesigns DeltaNet-style linear attention so it can train efficiently in parallel across sequence length, making it practical at language-model scale rather than just theoretically appealing. It explains how the work builds on the tradeoff between standard softmax attention’s strong token-level retrieval and linear attention’s compressed, constant-memory state, then argues that the delta rule offers smarter overwrite and recall behavior than simple additive memory updates. The discussion highlights why earlier DeltaNet variants were bottlenecked by sequential recurrence and poor GPU utilization, and why solving that systems problem matters for scaling to 1.3B-parameter models trained on 100B tokens. Listeners would find it interesting for its clear breakdown of how hardware constraints, associative memory, and long-context language modeling intersect—and why this approach aims to outperform strong linear-time baselines and even some transformer setups.
Sources:
1. Parallelizing Linear Transformers with the Delta Rule over Sequence Length — Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim, 2024
http://arxiv.org/abs/2406.06484
2. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
3. Gated Linear Attention Transformers with Hardware-Efficient Training — the GLA / Flash Linear Attention authors cited as [124], 2024
https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training
4. Were RNNs All We Needed? — the authors cited as [92], 2024
https://scholar.google.com/scholar?q=Were+RNNs+All+We+Needed?
5. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2024
https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces
6. DeltaNet: A Neural Sequence Model with Fast Weight Programmers — the authors cited as [101] including Imanol Schlag and collaborators, 2024
https://scholar.google.com/scholar?q=DeltaNet:+A+Neural+Sequence+Model+with+Fast+Weight+Programmers
7. The Compact WY Representation for Products of Householder Matrices — the authors cited as [11], 1989
https://scholar.google.com/scholar?q=The+Compact+WY+Representation+for+Products+of+Householder+Matrices
8. FlashAttention — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention
9. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry — the authors cited as [6], 2024
https://scholar.google.com/scholar?q=The+Hedgehog+&+the+Porcupine:+Expressive+Linear+Attentions+with+Softmax+Mimicry
10. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — approx. recent LLM systems/attention-efficiency authors, 2024/2025
https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads
11. StreamKV: Streaming Video Question-Answering with Segment-Based KV Cache Retrieval and Compression — approx. recent multimodal/streaming inference authors, 2024/2025
https://scholar.google.com/scholar?q=StreamKV:+Streaming+Video+Question-Answering+with+Segment-Based+KV+Cache+Retrieval+and+Compression
12. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent efficient-inference authors, 2024/2025
https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning
13. Augmenting Language Models with Long-Term Memory — approx. recent long-context/memory-augmented LM authors, 2024/2025
https://scholar.google.com/scholar?q=Augmenting+Language+Models+with+Long-Term+Memory
14. Retrieval Meets Long Context Large Language Models — approx. recent retrieval/long-context evaluation authors, 2024/2025
https://scholar.google.com/scholar?q=Retrieval+Meets+Long+Context+Large+Language+Models
15. Gated Delta Networks: Improving Mamba2 with Delta Rule — approx. recent linear-recurrent/model-architecture authors, 2025
https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule
16. Path Attention: Position Encoding via Accumulating Householder Transformations — approx. recent sequence-modeling authors, 2024/2025
https://scholar.google.com/scholar?q=Path+Attention:+Position+Encoding+via+Accumulating+Householder+Transformations
17. DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products — approx. recent linear-RNN authors, 2024/2025
https://scholar.google.com/scholar?q=DeltaProduct:+Improving+State-Tracking+in+Linear+RNNs+via+Householder+Products
18. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3
19. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
20. AI Post Transformers: RetNet: Retentive Networks: Transformer Successor for Large Language Models — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/retnet-retentive-networks-transformer-successor-for-large-language-models/
21. AI Post Transformers: Ring-linear: Efficient Hybrid Architecture for Long-Context Reasoning — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/ring-linear-efficient-hybrid-architecture-for-long-context-reasoning/
22. AI Post Transformers: Native Sparse Attention: Efficient Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/native-sparse-attention-efficient-long-context-llms/
23. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3
24. AI Post Transformers: ALiBi: Attention with Linear Biases Enables Length Extrapolation — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/alibi-attention-with-linear-biases-enables-length-extrapolation/
25. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3
26. AI Post Transformers: DRAM-Free In-Flash Computing for LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-09-dram-free-in-flash-computing-for-llm-inf-4ac216.mp3
Interactive Visualization: Parallelizing DeltaNet Linear Transformers over Sequence Length
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof