This episode explores TokenDance, a systems approach for serving many LLM-based agents more efficiently by collectively sharing transformer KV caches across synchronized conversation rounds. It explains why multi-agent workloads are fundamentally different from ordinary chat serving: agents persist across rounds, accumulate large KV caches, and often follow an “all-gather” pattern where each agent receives a mostly shared prompt plus its own private history, making standard prefix-based cache reuse ineffective. The discussion argues that the key innovation is shifting cache reuse from individual requests to the entire round of agents as a collective object, enabling memory savings and better scalability on the same GPU. Listeners interested in agent systems, inference infrastructure, and practical bottlenecks beyond model architecture will find it compelling for its concrete diagnosis of memory management as the real constraint.
Sources:
1. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026
http://arxiv.org/abs/2604.03143
2. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026
https://scholar.google.com/scholar?q=TokenDance:+Scaling+Multi-Agent+LLM+Serving+via+Collective+KV+Cache+Sharing
3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, et al., 2023
https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention
4. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Weizhe Chen, Ying Sheng, Tianqi Chen, Ion Stoica, and collaborators, 2024
https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs
5. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable — Xiangyao Yu and collaborators, 2024
https://scholar.google.com/scholar?q=Parrot:+Efficient+Serving+of+LLM-based+Applications+with+Semantic+Variable
6. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention
7. SGLang — SGLang team / related authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=SGLang
8. Parrot — Authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=Parrot
9. Autellix — Authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=Autellix
10. Tokencake — Authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=Tokencake
11. Generative Agents: Interactive Simulacra of Human Behavior — Joon Sung Park, Joseph O'Brien, Carrie Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein, 2023
https://scholar.google.com/scholar?q=Generative+Agents:+Interactive+Simulacra+of+Human+Behavior
12. Position-independent KV-cache reuse papers cited as [10, 34-36] — Authors as cited in the paper, 2024-2026
https://scholar.google.com/scholar?q=Position-independent+KV-cache+reuse+papers+cited+as+[10,+34-36]
13. OpenClaw — Authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=OpenClaw
14. MoltBook — Authors as cited in the paper, 2024
https://scholar.google.com/scholar?q=MoltBook
15. DynTaskMAS: A Dynamic Task Graph-Driven Framework for Asynchronous and Parallel LLM-Based Multi-Agent Systems — approx. recent multi-agent systems authors, 2024/2025
https://scholar.google.com/scholar?q=DynTaskMAS:+A+Dynamic+Task+Graph-Driven+Framework+for+Asynchronous+and+Parallel+LLM-Based+Multi-Agent+Systems
16. Kairos: Low-Latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud — approx. recent systems authors, 2024/2025
https://scholar.google.com/scholar?q=Kairos:+Low-Latency+Multi-Agent+Serving+with+Shared+LLMs+and+Excessive+Loads+in+the+Public+Cloud
17. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving — approx. recent LLM serving authors, 2024/2025
https://scholar.google.com/scholar?q=CacheSlide:+Unlocking+Cross+Position-Aware+KV+Cache+Reuse+for+Accelerating+LLM+Serving
18. Where Matters More Than What: Decoding-Aligned KV Cache Compression via Position-Aware Pseudo Queries — approx. recent KV compression authors, 2024/2025
https://scholar.google.com/scholar?q=Where+Matters+More+Than+What:+Decoding-Aligned+KV+Cache+Compression+via+Position-Aware+Pseudo+Queries
19. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent KV reuse authors, 2024/2025
https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
20. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. recent RAG authors, 2024/2025
https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse
21. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. recent RAG/KV authors, 2024/2025
https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation
22. Eigen Attention: Attention in Low-Rank Space for KV Cache Compression — approx. recent KV compression authors, 2024/2025
https://scholar.google.com/scholar?q=Eigen+Attention:+Attention+in+Low-Rank+Space+for+KV+Cache+Compression
23. PALU: KV-Cache Compression with Low-Rank Projection — approx. recent systems/ML authors, 2024/2025
https://scholar.google.com/scholar?q=PALU:+KV-Cache+Compression+with+Low-Rank+Projection
24. LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy — approx. recent KV compression authors, 2024/2025
https://scholar.google.com/scholar?q=LORC:+Low-Rank+Compression+for+LLMs+KV+Cache+with+a+Progressive+Compression+Strategy
25. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3
26. AI Post Transformers: ContiguousKV for Faster LLM Prefill KV Reuse — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-20-contiguouskv-for-faster-llm-prefill-kv-r-59f545.mp3
27. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3
28. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/
29. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
31. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3
32. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3
Interactive Visualization: TokenDance for Multi-Agent KV Cache Sharing