April 23, 2026

TokenDance for Multi-Agent KV Cache Sharing

20 minutes

This episode explores TokenDance, a systems approach for serving many LLM-based agents more efficiently by collectively sharing transformer KV caches across synchronized conversation rounds. It explains why multi-agent workloads are fundamentally different from ordinary chat serving: agents persist across rounds, accumulate large KV caches, and often follow an “all-gather” pattern where each agent receives a mostly shared prompt plus its own private history, making standard prefix-based cache reuse ineffective. The discussion argues that the key innovation is shifting cache reuse from individual requests to the entire round of agents as a collective object, enabling memory savings and better scalability on the same GPU. Listeners interested in agent systems, inference infrastructure, and practical bottlenecks beyond model architecture will find it compelling for its concrete diagnosis of memory management as the real constraint.

Sources:

1. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026

http://arxiv.org/abs/2604.03143

2. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026

https://scholar.google.com/scholar?q=TokenDance:+Scaling+Multi-Agent+LLM+Serving+via+Collective+KV+Cache+Sharing

3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, et al., 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

4. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Weizhe Chen, Ying Sheng, Tianqi Chen, Ion Stoica, and collaborators, 2024

https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs

5. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable — Xiangyao Yu and collaborators, 2024

https://scholar.google.com/scholar?q=Parrot:+Efficient+Serving+of+LLM-based+Applications+with+Semantic+Variable

6. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023

https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention

7. SGLang — SGLang team / related authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=SGLang

8. Parrot — Authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=Parrot

9. Autellix — Authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=Autellix

10. Tokencake — Authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=Tokencake

11. Generative Agents: Interactive Simulacra of Human Behavior — Joon Sung Park, Joseph O'Brien, Carrie Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein, 2023

https://scholar.google.com/scholar?q=Generative+Agents:+Interactive+Simulacra+of+Human+Behavior

12. Position-independent KV-cache reuse papers cited as [10, 34-36] — Authors as cited in the paper, 2024-2026

https://scholar.google.com/scholar?q=Position-independent+KV-cache+reuse+papers+cited+as+[10,+34-36]

13. OpenClaw — Authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=OpenClaw

14. MoltBook — Authors as cited in the paper, 2024

https://scholar.google.com/scholar?q=MoltBook

15. DynTaskMAS: A Dynamic Task Graph-Driven Framework for Asynchronous and Parallel LLM-Based Multi-Agent Systems — approx. recent multi-agent systems authors, 2024/2025

https://scholar.google.com/scholar?q=DynTaskMAS:+A+Dynamic+Task+Graph-Driven+Framework+for+Asynchronous+and+Parallel+LLM-Based+Multi-Agent+Systems

16. Kairos: Low-Latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud — approx. recent systems authors, 2024/2025

https://scholar.google.com/scholar?q=Kairos:+Low-Latency+Multi-Agent+Serving+with+Shared+LLMs+and+Excessive+Loads+in+the+Public+Cloud

17. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving — approx. recent LLM serving authors, 2024/2025

https://scholar.google.com/scholar?q=CacheSlide:+Unlocking+Cross+Position-Aware+KV+Cache+Reuse+for+Accelerating+LLM+Serving

18. Where Matters More Than What: Decoding-Aligned KV Cache Compression via Position-Aware Pseudo Queries — approx. recent KV compression authors, 2024/2025

https://scholar.google.com/scholar?q=Where+Matters+More+Than+What:+Decoding-Aligned+KV+Cache+Compression+via+Position-Aware+Pseudo+Queries

19. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent KV reuse authors, 2024/2025

https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse

20. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. recent RAG authors, 2024/2025

https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse

21. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. recent RAG/KV authors, 2024/2025

https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation

22. Eigen Attention: Attention in Low-Rank Space for KV Cache Compression — approx. recent KV compression authors, 2024/2025

https://scholar.google.com/scholar?q=Eigen+Attention:+Attention+in+Low-Rank+Space+for+KV+Cache+Compression

23. PALU: KV-Cache Compression with Low-Rank Projection — approx. recent systems/ML authors, 2024/2025

https://scholar.google.com/scholar?q=PALU:+KV-Cache+Compression+with+Low-Rank+Projection

24. LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy — approx. recent KV compression authors, 2024/2025

https://scholar.google.com/scholar?q=LORC:+Low-Rank+Compression+for+LLMs+KV+Cache+with+a+Progressive+Compression+Strategy

25. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3

26. AI Post Transformers: ContiguousKV for Faster LLM Prefill KV Reuse — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-20-contiguouskv-for-faster-llm-prefill-kv-r-59f545.mp3

27. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3

28. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/

29. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

31. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3

32. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3

Interactive Visualization: TokenDance for Multi-Agent KV Cache Sharing

...more

View all episodes

By mcgrof