AI Post Transformers

KV Cache TTL for Multi-Turn Agent Scheduling


Listen Later

This episode explores a systems paper on serving multi-turn LLM agents, asking whether an agent’s KV cache should be preserved during short tool-call pauses instead of being evicted at the end of each turn. It explains why standard end-of-turn eviction works for human chat but breaks for ReAct-style agents, where rapid tool use creates tightly coupled turns and makes cache loss expensive. The discussion highlights two main costs of eviction—recomputing or reloading long prefixes and the added per-turn queueing delay when resumed agent steps must re-enter service—framing the issue as a scheduling problem rather than simple memory management. Listeners would find it interesting because it shows how a seemingly low-level infrastructure choice can strongly affect agent latency, responsiveness, and the practical feel of AI systems.
Sources:
1. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live — Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica, 2025
http://arxiv.org/abs/2511.02230
2. InferCept — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=InferCept
3. Autellix — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=Autellix
4. Pie — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=Pie
5. Ayo — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=Ayo
6. Alto — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=Alto
7. Parrot — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=Parrot
8. vLLM — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=vLLM
9. CPU offloading for KV cache reuse — Not specified in excerpt, Unknown
https://scholar.google.com/scholar?q=CPU+offloading+for+KV+cache+reuse
10. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving — approx. recent systems/LLM serving authors, 2024/2025
https://scholar.google.com/scholar?q=CacheSlide:+Unlocking+Cross+Position-Aware+KV+Cache+Reuse+for+Accelerating+LLM+Serving
11. KVCOMM: Online Cross-Context KV-Cache Communication for Efficient LLM-Based Multi-Agent Systems — approx. recent multi-agent/LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=KVCOMM:+Online+Cross-Context+KV-Cache+Communication+for+Efficient+LLM-Based+Multi-Agent+Systems
12. When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges — approx. recent evaluation/multi-agent authors, 2024/2025
https://scholar.google.com/scholar?q=When+KV+Cache+Reuse+Fails+in+Multi-Agent+Systems:+Cross-Candidate+Interaction+is+Crucial+for+LLM+Judges
13. LayerKV: Optimizing Large Language Model Serving with Layer-Wise KV Cache Management — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=LayerKV:+Optimizing+Large+Language+Model+Serving+with+Layer-Wise+KV+Cache+Management
14. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — approx. recent inference systems authors, 2024/2025
https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management
15. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference — approx. recent long-context inference authors, 2024/2025
https://scholar.google.com/scholar?q=ShadowKV:+KV+Cache+in+Shadows+for+High-Throughput+Long-Context+LLM+Inference
16. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=KVPR:+Efficient+LLM+Inference+with+I/O-Aware+KV+Cache+Partial+Recomputation
17. Fairness in Serving Large Language Models — approx. recent LLM scheduling authors, 2024/2025
https://scholar.google.com/scholar?q=Fairness+in+Serving+Large+Language+Models
18. Locality-Aware Fair Scheduling in LLM Serving — approx. recent LLM serving authors, 2024/2025
https://scholar.google.com/scholar?q=Locality-Aware+Fair+Scheduling+in+LLM+Serving
19. Ensuring Fair LLM Serving Amid Diverse Applications — approx. recent serving/fairness authors, 2024/2025
https://scholar.google.com/scholar?q=Ensuring+Fair+LLM+Serving+Amid+Diverse+Applications
20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
21. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
22. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3
23. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/
24. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
Interactive Visualization: KV Cache TTL for Multi-Turn Agent Scheduling
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof