April 19, 2026

Gated Linear Attention for Efficient Long Sequences

18 minutes

This episode explores a paper on Gated Linear Attention Transformers that aims to make long-sequence modeling both higher quality and genuinely faster on modern GPUs. It explains how GLA replaces standard softmax attention with a gated, recurrent-style memory update that can better decide what information to keep, decay, or overwrite, positioning it between classic linear attention, RetNet-style decay models, and state-space approaches like Mamba. The discussion argues that earlier linear-attention methods often failed twice—underperforming on model quality and losing to optimized softmax baselines such as FlashAttention-2—so the real test is hardware efficiency, not just better asymptotic complexity. Listeners would find it interesting for its clear breakdown of why memory traffic, chunked training, and on-chip SRAM usage may determine whether linear attention becomes a practical alternative for long-context AI systems.

Sources:

1. Gated Linear Attention Transformers with Hardware-Efficient Training — Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim, 2023

http://arxiv.org/abs/2312.06635

2. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret, 2020

https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention

3. Rethinking Attention with Performers — Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos and others, 2021

https://scholar.google.com/scholar?q=Rethinking+Attention+with+Performers

4. Retentive Network: A Successor to Transformer for Large Language Models — Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue and others, 2023

https://scholar.google.com/scholar?q=Retentive+Network:+A+Successor+to+Transformer+for+Large+Language+Models

5. Gated Linear Attention Transformers with Hardware-Efficient Training — Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim, 2024

https://scholar.google.com/scholar?q=Gated+Linear+Attention+Transformers+with+Hardware-Efficient+Training

6. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023

https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning

7. Finetuned Language Models Are Zero-Shot Learners? / A Study of Linear Attention and State Space Models for Language Modeling — Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, Noah A. Smith, 2021

https://scholar.google.com/scholar?q=Finetuned+Language+Models+Are+Zero-Shot+Learners?+/+A+Study+of+Linear+Attention+and+State+Space+Models+for+Language+Modeling

8. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

9. TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer — Yushui Qin, Zihao Sun, Xiaoyu Li, et al., 2023

https://scholar.google.com/scholar?q=TransNormerLLM:+A+Faster+and+Better+Large+Language+Model+with+Improved+TransNormer

10. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

11. Blockwise Parallel Transformer for Large Context Models — Yunlong Hua, Tianyu Gao, et al., 2022

https://scholar.google.com/scholar?q=Blockwise+Parallel+Transformer+for+Large+Context+Models

12. Were RNNs All We Needed? — Jaap van der Westhuizen, Joan Lasenby, 2018

https://scholar.google.com/scholar?q=Were+RNNs+All+We+Needed?

13. On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective — approx. recent theory paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=On+the+Expressiveness+of+Softmax+Attention:+A+Recurrent+Neural+Network+Perspective

14. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry — approx. recent linear-attention paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=The+Hedgehog+&+the+Porcupine:+Expressive+Linear+Attentions+with+Softmax+Mimicry

15. Agent Attention: On the Integration of Softmax and Linear Attention — approx. recent attention paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=Agent+Attention:+On+the+Integration+of+Softmax+and+Linear+Attention

16. Transformer Based Linear Attention with Optimized GPU Kernel Implementation — approx. recent systems paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=Transformer+Based+Linear+Attention+with+Optimized+GPU+Kernel+Implementation

17. FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention — approx. recent systems/compiler paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=FlexLinearAttention:+Compiling+a+Unified+Abstraction+into+Scalable+Kernels+for+Linear+Attention

18. PyramidInfer: Pyramid KV Cache Compression for High-Throughput LLM Inference — approx. recent inference paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-Throughput+LLM+Inference

19. Inference-time Hyper-Scaling with KV Cache Compression — approx. recent inference paper; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=Inference-time+Hyper-Scaling+with+KV+Cache+Compression

20. KV Cache Compression for Inference Efficiency in LLMs: A Review — approx. recent review; exact authors not recoverable from snippet, 2024/2025

https://scholar.google.com/scholar?q=KV+Cache+Compression+for+Inference+Efficiency+in+LLMs:+A+Review

21. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3

22. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

23. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3

24. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3

25. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3

26. AI Post Transformers: TriAttention for Efficient Long-Context KV Compression — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-triattention-for-efficient-long-context-6c08ee.mp3

Interactive Visualization: Gated Linear Attention for Efficient Long Sequences

...more

View all episodes

By mcgrof