April 19, 2026

Parallelizing DeltaNet Linear Transformers over Sequence Length

19 minutes

This episode explores a June 2024 paper that redesigns DeltaNet-style linear attention so it can train efficiently in parallel across sequence length, making it practical at language-model scale rather than just theoretically appealing. It explains how the work builds on the tradeoff between standard softmax attention’s strong token-level retrieval and linear attention’s compressed, constant-memory state, then argues that the delta rule offers smarter overwrite and recall behavior than simple additive memory updates. The discussion highlights why earlier DeltaNet variants were bottlenecked by sequential recurrence and poor GPU utilization, and why solving that systems problem matters for scaling to 1.3B-parameter models trained on 100B tokens. Listeners would find it interesting for its clear breakdown of how hardware constraints, associative memory, and long-context language modeling intersect—and why this approach aims to outperform strong linear-time baselines and even some transformer setups.

Sources:

1. Parallelizing Linear Transformers with the Delta Rule over Sequence Length — Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim, 2024

http://arxiv.org/abs/2406.06484

2. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality

3. Gated Linear Attention Transformers with Hardware-Efficient Training — the GLA / Flash Linear Attention authors cited as [124], 2024

https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training

4. Were RNNs All We Needed? — the authors cited as [92], 2024

https://scholar.google.com/scholar?q=Were+RNNs+All+We+Needed?

5. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2024

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

6. DeltaNet: A Neural Sequence Model with Fast Weight Programmers — the authors cited as [101] including Imanol Schlag and collaborators, 2024

https://scholar.google.com/scholar?q=DeltaNet:+A+Neural+Sequence+Model+with+Fast+Weight+Programmers

7. The Compact WY Representation for Products of Householder Matrices — the authors cited as [11], 1989

https://scholar.google.com/scholar?q=The+Compact+WY+Representation+for+Products+of+Householder+Matrices

8. FlashAttention — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention

9. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry — the authors cited as [6], 2024

https://scholar.google.com/scholar?q=The+Hedgehog+&+the+Porcupine:+Expressive+Linear+Attentions+with+Softmax+Mimicry

10. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — approx. recent LLM systems/attention-efficiency authors, 2024/2025

https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads

11. StreamKV: Streaming Video Question-Answering with Segment-Based KV Cache Retrieval and Compression — approx. recent multimodal/streaming inference authors, 2024/2025

https://scholar.google.com/scholar?q=StreamKV:+Streaming+Video+Question-Answering+with+Segment-Based+KV+Cache+Retrieval+and+Compression

12. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent efficient-inference authors, 2024/2025

https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning

13. Augmenting Language Models with Long-Term Memory — approx. recent long-context/memory-augmented LM authors, 2024/2025

https://scholar.google.com/scholar?q=Augmenting+Language+Models+with+Long-Term+Memory

14. Retrieval Meets Long Context Large Language Models — approx. recent retrieval/long-context evaluation authors, 2024/2025

https://scholar.google.com/scholar?q=Retrieval+Meets+Long+Context+Large+Language+Models

15. Gated Delta Networks: Improving Mamba2 with Delta Rule — approx. recent linear-recurrent/model-architecture authors, 2025

https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule

16. Path Attention: Position Encoding via Accumulating Householder Transformations — approx. recent sequence-modeling authors, 2024/2025

https://scholar.google.com/scholar?q=Path+Attention:+Position+Encoding+via+Accumulating+Householder+Transformations

17. DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products — approx. recent linear-RNN authors, 2024/2025

https://scholar.google.com/scholar?q=DeltaProduct:+Improving+State-Tracking+in+Linear+RNNs+via+Householder+Products

18. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3

19. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

20. AI Post Transformers: RetNet: Retentive Networks: Transformer Successor for Large Language Models — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/retnet-retentive-networks-transformer-successor-for-large-language-models/

21. AI Post Transformers: Ring-linear: Efficient Hybrid Architecture for Long-Context Reasoning — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/ring-linear-efficient-hybrid-architecture-for-long-context-reasoning/

22. AI Post Transformers: Native Sparse Attention: Efficient Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/native-sparse-attention-efficient-long-context-llms/

23. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3

24. AI Post Transformers: ALiBi: Attention with Linear Biases Enables Length Extrapolation — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/alibi-attention-with-linear-biases-enables-length-extrapolation/

25. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3

26. AI Post Transformers: DRAM-Free In-Flash Computing for LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-09-dram-free-in-flash-computing-for-llm-inf-4ac216.mp3

Interactive Visualization: Parallelizing DeltaNet Linear Transformers over Sequence Length

...more

View all episodes

By mcgrof