This episode explores RetrievalAttention, a 2024 paper that tries to make long-context LLM inference much cheaper by retrieving only the most relevant key-value cache entries during decoding instead of scanning the entire history every time. It explains why long-context serving is bottlenecked less by raw FLOPs than by memory traffic and KV-cache growth, citing concrete figures such as roughly 125 GB of KV cache per million tokens for Llama-3-8B and decoding latency that balloons from 32.8 seconds at 128K tokens to 1,765 seconds at 1M. The discussion argues that attention is dynamically sparse in practice, but also emphasizes a key technical caveat: standard vector search is not automatically a good proxy for attention lookup, so retrieval-based sparsity has to be designed carefully. Listeners would find it interesting because it connects transformer modeling, systems bottlenecks, and vector retrieval into a practical strategy for making million-token context windows more usable in real deployments.
Sources:
1. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval — Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu, 2024
http://arxiv.org/abs/2409.10516
2. StreamingLLM — Xiao et al., 2024
https://scholar.google.com/scholar?q=StreamingLLM
3. SnapKV — Li et al. or related 2024 sparse/compact KV-cache work cited in the paper's framing, 2024
https://scholar.google.com/scholar?q=SnapKV
4. InfLLM — Xiao et al. / related long-context inference work cited by the paper, 2024
https://scholar.google.com/scholar?q=InfLLM
5. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs — Malkov and Yashunin, 2018
https://scholar.google.com/scholar?q=Efficient+and+robust+approximate+nearest+neighbor+search+using+Hierarchical+Navigable+Small+World+graphs
6. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Katharopoulos et al., 2020
https://scholar.google.com/scholar?q=Transformers+are+RNNs:+Fast+Autoregressive+Transformers+with+Linear+Attention
7. H2O / Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — likely cited among the 2024 heuristic sparse-attention/KV methods, 2023
https://scholar.google.com/scholar?q=H2O+/+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models
8. FlexGen — Sheng et al., 2023
https://scholar.google.com/scholar?q=FlexGen
9. The Needle in a Haystack test / long-context retrieval benchmark family — various, 2023-2024
https://scholar.google.com/scholar?q=The+Needle+in+a+Haystack+test+/+long-context+retrieval+benchmark+family
10. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long-Context Capable Approaches — approx. recent benchmark paper, authors unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=KV+Cache+Compression,+But+What+Must+We+Give+in+Return?+A+Comprehensive+Benchmark+of+Long-Context+Capable+Approaches
11. RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression — approx. RocketKV authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=RocketKV:+Accelerating+Long-Context+LLM+Inference+via+Two-Stage+KV+Cache+Compression
12. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — approx. adaptive KV merging authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Model+Tells+You+Where+to+Merge:+Adaptive+KV+Cache+Merging+for+LLMs+on+Long-Context+Tasks
13. Efficient Low Rank Attention for Long-Context Inference in Large Language Models — approx. LRQK / low-rank attention authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Efficient+Low+Rank+Attention+for+Long-Context+Inference+in+Large+Language+Models
14. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation — approx. KVPR authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=KVPR:+Efficient+LLM+Inference+with+I/O-Aware+KV+Cache+Partial+Recomputation
15. ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference — approx. ScoutAttention authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=ScoutAttention:+Efficient+KV+Cache+Offloading+via+Layer-Ahead+CPU+Pre-computation+for+LLM+Inference
16. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — approx. DuoAttention authors, unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads
17. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
18. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
19. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3
20. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3
21. AI Post Transformers: TriAttention for Efficient Long-Context KV Compression — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-triattention-for-efficient-long-context-6c08ee.mp3
22. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3
23. AI Post Transformers: GPU-Accelerated Dynamic Quantized ANNS Graph Search — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-gpu-accelerated-dynamic-quantized-anns-g-f2cd4e.mp3
24. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
25. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
26. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3
27. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/
Interactive Visualization: RetrievalAttention for Long-Context LLM Inference