AI Post Transformers

Gated Delta Networks for Long-Context Retrieval


Listen Later

This episode explores Gated Delta Networks, a sequence-modeling approach that combines Mamba-style gating with DeltaNet-style selective memory updates to improve long-context and retrieval-heavy performance. It explains how linear attention and state-space models compress the past into a fixed recurrent state, why that makes them hardware-efficient, and where they often fail: memory collisions that blur stored associations and weaken retrieval. The discussion argues that gating is useful for broad forgetting while delta updates enable targeted overwrites, making their combination a promising way to preserve retrieval quality without the quadratic costs of standard attention. Listeners would find it interesting for its clear framing of the tradeoff between efficiency and memory fidelity, and for its practical focus on whether these architectures can move beyond elegant theory into GPU-friendly, real-world use.
Sources:
1. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024
http://arxiv.org/abs/2412.06464
2. Linear Transformers Are Secretly Fast Weight Programmers — Michael Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021
https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers
3. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret, 2020
https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention
4. The Delta Transformer: A Highly Efficient and Effective Transformer Architecture for Sequence Modeling — Songlin Yang, Bailin Wang, Yikang Shen, Yoon Kim, others, 2024
https://scholar.google.com/scholar?q=The+Delta+Transformer:+A+Highly+Efficient+and+Effective+Transformer+Architecture+for+Sequence+Modeling
5. Gated Linear Attention Transformers with Hardware-Efficient Training — Songlin Yang, others, 2024
https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training
6. The Delta Transformer: A Highly Effective Transformer Layer for Sequence Modeling — David E. Rumelhart? / not this one; cited lineage centers on Schlag et al., 2021
https://scholar.google.com/scholar?q=The+Delta+Transformer:+A+Highly+Effective+Transformer+Layer+for+Sequence+Modeling
7. Titans? / DeltaNet follow-up by Yang et al. 2024b — Songlin Yang and collaborators, 2024
https://scholar.google.com/scholar?q=Titans?+/+DeltaNet+follow-up+by+Yang+et+al.+2024b
8. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024
https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality
9. Linear Transformers as Associative Memories — Michael Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021
https://scholar.google.com/scholar?q=Linear+Transformers+as+Associative+Memories
10. Learning to (Learn at Test Time): RNNs with Expressive Hidden States — Akyürek et al., 2024
https://scholar.google.com/scholar?q=Learning+to+(Learn+at+Test+Time):+RNNs+with+Expressive+Hidden+States
11. From Neural Network Learning to Associative Memories — Bernard Widrow and collaborators, 1960
https://scholar.google.com/scholar?q=From+Neural+Network+Learning+to+Associative+Memories
12. Householder and WY Representations / WY Representation for Products of Householder Matrices — C. Bischof, Charles Van Loan, 1985
https://scholar.google.com/scholar?q=Householder+and+WY+Representations+/+WY+Representation+for+Products+of+Householder+Matrices
13. Hungry Hungry Hippos: Towards Language Modeling with State Space Models — Hua et al., 2022
https://scholar.google.com/scholar?q=Hungry+Hungry+Hippos:+Towards+Language+Modeling+with+State+Space+Models
14. SILA: Enhancing Long-Context Retrieval Capability of Linear Attention via Selective Ignoring — approx. recent linear-attention/long-context authors, 2025
https://scholar.google.com/scholar?q=SILA:+Enhancing+Long-Context+Retrieval+Capability+of+Linear+Attention+via+Selective+Ignoring
15. Simple linear attention language models balance the recall-throughput tradeoff — approx. recent linear-attention LM authors, 2025
https://scholar.google.com/scholar?q=Simple+linear+attention+language+models+balance+the+recall-throughput+tradeoff
16. A systematic analysis of hybrid linear attention — approx. recent hybrid-attention authors, 2025
https://scholar.google.com/scholar?q=A+systematic+analysis+of+hybrid+linear+attention
17. Understanding transformer from the perspective of associative memory — approx. recent associative-memory authors, 2024 or 2025
https://scholar.google.com/scholar?q=Understanding+transformer+from+the+perspective+of+associative+memory
18. Bayesian Optimality of In-Context Learning with Selective State Spaces — approx. recent selective-SSM theory authors, 2025
https://scholar.google.com/scholar?q=Bayesian+Optimality+of+In-Context+Learning+with+Selective+State+Spaces
19. Sliding window attention training for efficient large language models — approx. SWAT authors, 2025
https://scholar.google.com/scholar?q=Sliding+window+attention+training+for+efficient+large+language+models
20. SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining — approx. SWAA authors, 2025
https://scholar.google.com/scholar?q=SWAA:+Sliding+Window+Attention+Adaptation+for+Efficient+Long-Context+LLMs+Without+Pretraining
21. Short window attention enables long-term memorization — approx. recent long-context attention authors, 2025
https://scholar.google.com/scholar?q=Short+window+attention+enables+long-term+memorization
22. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/
23. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3
24. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3
25. AI Post Transformers: Longformer: A Transformer for Long Documents — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/longformer-a-transformer-for-long-documents/
26. AI Post Transformers: Optimizing Mixture of Block Attention Through Statistical Theory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-optimizing-mixture-of-block-attention-th-214f91.mp3
Interactive Visualization: Gated Delta Networks for Long-Context Retrieval
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof