This episode explores a systems paper on speeding up retrieval-augmented generation by reusing KV caches for frequently repeated retrieved documents, even when those documents are not exact prompt prefixes. It explains why long RAG prompts make prefill the main latency bottleneck, why standard prefix caching only helps in narrow cases, and why naive non-prefix cache reuse can hurt quality by ignoring cross-chunk attention between the query and retrieved passages. The discussion centers on CacheBlend’s core argument: selectively recomputing only the parts of a reused chunk that need updated context could preserve answer quality while significantly improving time-to-first-token. Listeners would find it interesting for its practical focus on the tradeoff between real-world serving speed and faithful multi-document reasoning, rather than on new model architectures.
Sources:
1. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 2024
http://arxiv.org/abs/2405.16444
2. Prompt Cache: Modular Attention Reuse for Low-Latency Inference — Yao Fu, et al., 2024
https://scholar.google.com/scholar?q=Prompt+Cache:+Modular+Attention+Reuse+for+Low-Latency+Inference
3. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Junxian He, et al., 2024
https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving
4. RadixAttention for Efficient KV Cache Sharing in LLM Serving — LMSYS / SGLang authors, 2024
https://scholar.google.com/scholar?q=RadixAttention+for+Efficient+KV+Cache+Sharing+in+LLM+Serving
5. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention
6. Memorizing Transformers — Angeliki Lazaridou, et al., 2022
https://scholar.google.com/scholar?q=Memorizing+Transformers
7. FlashAttention — Tri Dao, et al., 2022
https://scholar.google.com/scholar?q=FlashAttention
8. A Survey on Retrieval-Augmented Text Generation — Zhiheng Gao, et al., 2024
https://scholar.google.com/scholar?q=A+Survey+on+Retrieval-Augmented+Text+Generation
9. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent systems/LLM serving authors, 2024/2025
https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
10. An Experimental Study of KV Cache Reuse Strategies in Chunk-Level Caching Systems — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=An+Experimental+Study+of+KV+Cache+Reuse+Strategies+in+Chunk-Level+Caching+Systems
11. Efficient Streaming Language Models with Attention Sinks — Xiao et al. / approximate, 2024
https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks
12. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation — approx. survey authors, 2024/2025
https://scholar.google.com/scholar?q=Attention+Sink+in+Transformers:+A+Survey+on+Utilization,+Interpretation,+and+Mitigation
13. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent RAG evaluation authors, 2024
https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits
14. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG — approx. recent RAG authors, 2024
https://scholar.google.com/scholar?q=Long-Context+LLMs+Meet+RAG:+Overcoming+Challenges+for+Long+Inputs+in+RAG
15. KV Cache Offloading for Context-Intensive Tasks — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=KV+Cache+Offloading+for+Context-Intensive+Tasks
16. KVSwap: Disk-Aware KV Cache Offloading for Long-Context On-Device Inference — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=KVSwap:+Disk-Aware+KV+Cache+Offloading+for+Long-Context+On-Device+Inference
17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3
18. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3
19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3
20. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3
21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3
22. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
Interactive Visualization: CacheBlend for Fast RAG Serving