April 22, 2026

CacheBlend for Fast RAG Serving

This episode explores a systems paper on speeding up retrieval-augmented generation by reusing KV caches for frequently repeated retrieved documents, even when those documents are not exact prompt prefixes. It explains why long RAG prompts make prefill the main latency bottleneck, why standard prefix caching only helps in narrow cases, and why naive non-prefix cache reuse can hurt quality by ignoring cross-chunk attention between the query and retrieved passages. The discussion centers on CacheBlend’s core argument: selectively recomputing only the parts of a reused chunk that need updated context could preserve answer quality while significantly improving time-to-first-token. Listeners would find it interesting for its practical focus on the tradeoff between real-world serving speed and faithful multi-document reasoning, rather than on new model architectures.

Sources:

1. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 2024

http://arxiv.org/abs/2405.16444

2. Prompt Cache: Modular Attention Reuse for Low-Latency Inference — Yao Fu, et al., 2024

https://scholar.google.com/scholar?q=Prompt+Cache:+Modular+Attention+Reuse+for+Low-Latency+Inference

3. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Junxian He, et al., 2024

https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving

4. RadixAttention for Efficient KV Cache Sharing in LLM Serving — LMSYS / SGLang authors, 2024

https://scholar.google.com/scholar?q=RadixAttention+for+Efficient+KV+Cache+Sharing+in+LLM+Serving

5. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, et al., 2023

https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention

6. Memorizing Transformers — Angeliki Lazaridou, et al., 2022

https://scholar.google.com/scholar?q=Memorizing+Transformers

7. FlashAttention — Tri Dao, et al., 2022

https://scholar.google.com/scholar?q=FlashAttention

8. A Survey on Retrieval-Augmented Text Generation — Zhiheng Gao, et al., 2024

https://scholar.google.com/scholar?q=A+Survey+on+Retrieval-Augmented+Text+Generation

9. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent systems/LLM serving authors, 2024/2025

https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse

10. An Experimental Study of KV Cache Reuse Strategies in Chunk-Level Caching Systems — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=An+Experimental+Study+of+KV+Cache+Reuse+Strategies+in+Chunk-Level+Caching+Systems

11. Efficient Streaming Language Models with Attention Sinks — Xiao et al. / approximate, 2024

https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks

12. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation — approx. survey authors, 2024/2025

https://scholar.google.com/scholar?q=Attention+Sink+in+Transformers:+A+Survey+on+Utilization,+Interpretation,+and+Mitigation

13. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent RAG evaluation authors, 2024

https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits

14. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG — approx. recent RAG authors, 2024

https://scholar.google.com/scholar?q=Long-Context+LLMs+Meet+RAG:+Overcoming+Challenges+for+Long+Inputs+in+RAG

15. KV Cache Offloading for Context-Intensive Tasks — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=KV+Cache+Offloading+for+Context-Intensive+Tasks

16. KVSwap: Disk-Aware KV Cache Offloading for Long-Context On-Device Inference — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=KVSwap:+Disk-Aware+KV+Cache+Offloading+for+Long-Context+On-Device+Inference

17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3

18. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3

19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3

20. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3

21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3

22. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

Interactive Visualization: CacheBlend for Fast RAG Serving

...more

View all episodes

By mcgrof