April 08, 2026

Memory Sparse Attention for 100M-Token Scaling

This episode explores a paper proposing Memory Sparse Attention, an end-to-end trainable memory architecture designed to scale language models from ordinary long-context settings to 100 million tokens. The discussion explains why standard dense self-attention becomes infeasible at extreme lengths, distinguishes simple context-window extension from true “lifetime-scale” memory, and situates the approach among alternatives like parameter-based memory, recurrent compression, and external retrieval systems such as RAG. It argues that the paper’s core idea is selective, trainable access to a small set of relevant memory segments rather than treating all past tokens as one continuous stream, while also noting the authors’ ambitious systems claims around practical inference. A listener would find it interesting for its clear framing of what makes ultra-long-context modeling hard, and for its skeptical but concrete examination of whether this architecture meaningfully bridges the gap between long prompts and persistent memory.

Sources:

1. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens — Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen, 2026

http://arxiv.org/abs/2603.23516

2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

3. Ring Attention with Blockwise Transformers for Near-Infinite Context — Aidan N. Gomez, Sean Dao, and collaborators, 2023

https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context

4. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019

https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism

5. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023

https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning

6. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021

https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding

7. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation — Ofir Press, Noah A. Smith, Mike Lewis, 2021

https://scholar.google.com/scholar?q=Train+Short,+Test+Long:+Attention+with+Linear+Biases+Enables+Input+Length+Extrapolation

8. Extending Context Window of Large Language Models via Positional Interpolation — Shouyuan Chen, Sherman Wong, Liangcheng Luo, et al., 2023

https://scholar.google.com/scholar?q=Extending+Context+Window+of+Large+Language+Models+via+Positional+Interpolation

9. YaRN: Efficient Context Window Extension of Large Language Models — Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enming Luo, 2023

https://scholar.google.com/scholar?q=YaRN:+Efficient+Context+Window+Extension+of+Large+Language+Models

10. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025

https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time

11. Infini-attention: Infinite Context for Efficient Transformers — Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel, 2024

https://scholar.google.com/scholar?q=Infini-attention:+Infinite+Context+for+Efficient+Transformers

12. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens — Yucheng Ding, Li Dong, et al., 2024

https://scholar.google.com/scholar?q=LongRoPE:+Extending+LLM+Context+Window+Beyond+2+Million+Tokens

13. MemGPT: Towards LLMs as Operating Systems — Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, Joseph E. Gonzalez, 2024

https://scholar.google.com/scholar?q=MemGPT:+Towards+LLMs+as+Operating+Systems

14. RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandara Piktus, et al., 2020

https://scholar.google.com/scholar?q=RAG:+Retrieval-Augmented+Generation+for+Knowledge-Intensive+NLP+Tasks

15. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zhenyu Liu, et al., 2024

https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache

16. PagedAttention / vLLM: Efficient Memory Management for Large Language Model Serving — Woosuk Kwon, Zhuohan Li, et al., 2023

https://scholar.google.com/scholar?q=PagedAttention+/+vLLM:+Efficient+Memory+Management+for+Large+Language+Model+Serving

17. Memorizing Transformers — Yannic Kilcher? no; actually Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al., 2022

https://scholar.google.com/scholar?q=Memorizing+Transformers

18. TransformerFAM / Focused Attention Memory variants for long-context retrieval — Various 2024-2025 authors, 2024-2025

https://scholar.google.com/scholar?q=TransformerFAM+/+Focused+Attention+Memory+variants+for+long-context+retrieval

19. OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=OpenRAG:+Optimizing+RAG+End-to-End+via+In-Context+Retrieval+Learning

20. Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=Beyond+RAG+for+Agent+Memory:+Retrieval+by+Decoupling+and+Aggregation

21. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — approx. 2024/2025 authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention

22. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=SampleAttention:+Near-Lossless+Acceleration+of+Long+Context+LLM+Inference+with+Adaptive+Structured+Sparse+Attention

23. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=FlexPrefill:+A+Context-Aware+Sparse+Attention+Mechanism+for+Efficient+Long-Sequence+Inference

24. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse

25. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse

26. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. 2025 authors unclear from snippet, 2025

https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation

27. Hierarchical Local-Global Transformer With Dynamic Positional Encoding for Document-Level Machine Translation — approx. 2024/2025 authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Hierarchical+Local-Global+Transformer+With+Dynamic+Positional+Encoding+for+Document-Level+Machine+Translation

28. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3

29. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

30. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3

31. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3

32. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3

33. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3

34. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3

...more

View all episodes

By mcgrof