April 12, 2026

KV Cache TTL for Multi-Turn Agent Scheduling

This episode explores a systems paper on serving multi-turn LLM agents, asking whether an agent’s KV cache should be preserved during short tool-call pauses instead of being evicted at the end of each turn. It explains why standard end-of-turn eviction works for human chat but breaks for ReAct-style agents, where rapid tool use creates tightly coupled turns and makes cache loss expensive. The discussion highlights two main costs of eviction—recomputing or reloading long prefixes and the added per-turn queueing delay when resumed agent steps must re-enter service—framing the issue as a scheduling problem rather than simple memory management. Listeners would find it interesting because it shows how a seemingly low-level infrastructure choice can strongly affect agent latency, responsiveness, and the practical feel of AI systems.

Sources:

1. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live — Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica, 2025

http://arxiv.org/abs/2511.02230

2. InferCept — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=InferCept

3. Autellix — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=Autellix

4. Pie — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=Pie

5. Ayo — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=Ayo

6. Alto — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=Alto

7. Parrot — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=Parrot

8. vLLM — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=vLLM

9. CPU offloading for KV cache reuse — Not specified in excerpt, Unknown

https://scholar.google.com/scholar?q=CPU+offloading+for+KV+cache+reuse

10. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving — approx. recent systems/LLM serving authors, 2024/2025

https://scholar.google.com/scholar?q=CacheSlide:+Unlocking+Cross+Position-Aware+KV+Cache+Reuse+for+Accelerating+LLM+Serving

11. KVCOMM: Online Cross-Context KV-Cache Communication for Efficient LLM-Based Multi-Agent Systems — approx. recent multi-agent/LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=KVCOMM:+Online+Cross-Context+KV-Cache+Communication+for+Efficient+LLM-Based+Multi-Agent+Systems

12. When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges — approx. recent evaluation/multi-agent authors, 2024/2025

https://scholar.google.com/scholar?q=When+KV+Cache+Reuse+Fails+in+Multi-Agent+Systems:+Cross-Candidate+Interaction+is+Crucial+for+LLM+Judges

13. LayerKV: Optimizing Large Language Model Serving with Layer-Wise KV Cache Management — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=LayerKV:+Optimizing+Large+Language+Model+Serving+with+Layer-Wise+KV+Cache+Management

14. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — approx. recent inference systems authors, 2024/2025

https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management

15. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference — approx. recent long-context inference authors, 2024/2025

https://scholar.google.com/scholar?q=ShadowKV:+KV+Cache+in+Shadows+for+High-Throughput+Long-Context+LLM+Inference

16. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=KVPR:+Efficient+LLM+Inference+with+I/O-Aware+KV+Cache+Partial+Recomputation

17. Fairness in Serving Large Language Models — approx. recent LLM scheduling authors, 2024/2025

https://scholar.google.com/scholar?q=Fairness+in+Serving+Large+Language+Models

18. Locality-Aware Fair Scheduling in LLM Serving — approx. recent LLM serving authors, 2024/2025

https://scholar.google.com/scholar?q=Locality-Aware+Fair+Scheduling+in+LLM+Serving

19. Ensuring Fair LLM Serving Amid Diverse Applications — approx. recent serving/fairness authors, 2024/2025

https://scholar.google.com/scholar?q=Ensuring+Fair+LLM+Serving+Amid+Diverse+Applications

20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

21. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3

22. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3

23. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/

24. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

Interactive Visualization: KV Cache TTL for Multi-Turn Agent Scheduling

...more

View all episodes

By mcgrof