April 20, 2026

Gated Delta Networks for Long-Context Retrieval

18 minutes

This episode explores Gated Delta Networks, a sequence-modeling approach that combines Mamba-style gating with DeltaNet-style selective memory updates to improve long-context and retrieval-heavy performance. It explains how linear attention and state-space models compress the past into a fixed recurrent state, why that makes them hardware-efficient, and where they often fail: memory collisions that blur stored associations and weaken retrieval. The discussion argues that gating is useful for broad forgetting while delta updates enable targeted overwrites, making their combination a promising way to preserve retrieval quality without the quadratic costs of standard attention. Listeners would find it interesting for its clear framing of the tradeoff between efficiency and memory fidelity, and for its practical focus on whether these architectures can move beyond elegant theory into GPU-friendly, real-world use.

Sources:

1. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024

http://arxiv.org/abs/2412.06464

2. Linear Transformers Are Secretly Fast Weight Programmers — Michael Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021

https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers

3. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret, 2020

https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention

4. The Delta Transformer: A Highly Efficient and Effective Transformer Architecture for Sequence Modeling — Songlin Yang, Bailin Wang, Yikang Shen, Yoon Kim, others, 2024

https://scholar.google.com/scholar?q=The+Delta+Transformer:+A+Highly+Efficient+and+Effective+Transformer+Architecture+for+Sequence+Modeling

5. Gated Linear Attention Transformers with Hardware-Efficient Training — Songlin Yang, others, 2024

https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training

6. The Delta Transformer: A Highly Effective Transformer Layer for Sequence Modeling — David E. Rumelhart? / not this one; cited lineage centers on Schlag et al., 2021

https://scholar.google.com/scholar?q=The+Delta+Transformer:+A+Highly+Effective+Transformer+Layer+for+Sequence+Modeling

7. Titans? / DeltaNet follow-up by Yang et al. 2024b — Songlin Yang and collaborators, 2024

https://scholar.google.com/scholar?q=Titans?+/+DeltaNet+follow-up+by+Yang+et+al.+2024b

8. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality

9. Linear Transformers as Associative Memories — Michael Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021

https://scholar.google.com/scholar?q=Linear+Transformers+as+Associative+Memories

10. Learning to (Learn at Test Time): RNNs with Expressive Hidden States — Akyürek et al., 2024

https://scholar.google.com/scholar?q=Learning+to+(Learn+at+Test+Time):+RNNs+with+Expressive+Hidden+States

11. From Neural Network Learning to Associative Memories — Bernard Widrow and collaborators, 1960

https://scholar.google.com/scholar?q=From+Neural+Network+Learning+to+Associative+Memories

12. Householder and WY Representations / WY Representation for Products of Householder Matrices — C. Bischof, Charles Van Loan, 1985

https://scholar.google.com/scholar?q=Householder+and+WY+Representations+/+WY+Representation+for+Products+of+Householder+Matrices

13. Hungry Hungry Hippos: Towards Language Modeling with State Space Models — Hua et al., 2022

https://scholar.google.com/scholar?q=Hungry+Hungry+Hippos:+Towards+Language+Modeling+with+State+Space+Models

14. SILA: Enhancing Long-Context Retrieval Capability of Linear Attention via Selective Ignoring — approx. recent linear-attention/long-context authors, 2025

https://scholar.google.com/scholar?q=SILA:+Enhancing+Long-Context+Retrieval+Capability+of+Linear+Attention+via+Selective+Ignoring

15. Simple linear attention language models balance the recall-throughput tradeoff — approx. recent linear-attention LM authors, 2025

https://scholar.google.com/scholar?q=Simple+linear+attention+language+models+balance+the+recall-throughput+tradeoff

16. A systematic analysis of hybrid linear attention — approx. recent hybrid-attention authors, 2025

https://scholar.google.com/scholar?q=A+systematic+analysis+of+hybrid+linear+attention

17. Understanding transformer from the perspective of associative memory — approx. recent associative-memory authors, 2024 or 2025

https://scholar.google.com/scholar?q=Understanding+transformer+from+the+perspective+of+associative+memory

18. Bayesian Optimality of In-Context Learning with Selective State Spaces — approx. recent selective-SSM theory authors, 2025

https://scholar.google.com/scholar?q=Bayesian+Optimality+of+In-Context+Learning+with+Selective+State+Spaces

19. Sliding window attention training for efficient large language models — approx. SWAT authors, 2025

https://scholar.google.com/scholar?q=Sliding+window+attention+training+for+efficient+large+language+models

20. SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining — approx. SWAA authors, 2025

https://scholar.google.com/scholar?q=SWAA:+Sliding+Window+Attention+Adaptation+for+Efficient+Long-Context+LLMs+Without+Pretraining

21. Short window attention enables long-term memorization — approx. recent long-context attention authors, 2025

https://scholar.google.com/scholar?q=Short+window+attention+enables+long-term+memorization

22. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

23. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3

24. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3

25. AI Post Transformers: Longformer: A Transformer for Long Documents — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/longformer-a-transformer-for-long-documents/

26. AI Post Transformers: Optimizing Mixture of Block Attention Through Statistical Theory — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-18-optimizing-mixture-of-block-attention-th-214f91.mp3

Interactive Visualization: Gated Delta Networks for Long-Context Retrieval

...more

View all episodes

By mcgrof