This episode explores whether newer hybrid-attention language models make prefill-decode disaggregation practical across clusters or even datacenters by shrinking the KV cache enough to move it over ordinary Ethernet. It explains why the real production bottleneck is not request routing but transferring the attention state between prefill and decode, and contrasts dense transformers—where KV cache grows heavily with context across many layers—with hybrid designs that use fewer full-attention layers and more bounded-state alternatives. The discussion highlights the paper’s central claim that smaller KV footprints could enable remote, compute-dense prefill clusters and local decode clusters, especially for long, uncached prompts, while also questioning how broadly that conclusion generalizes given the evidence comes from a single internal 1-trillion-parameter model. Listeners would find it interesting for its concrete systems view of where disaggregated inference actually breaks, and for its argument that model architecture—not just serving software—may determine whether cross-cluster AI serving is viable.
Sources:
1. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter — Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang, 2026
http://arxiv.org/abs/2604.15039
2. Mooncake: Trading More Storage for Less Computation — KVCache-centric Architecture for Serving LLM Chatbot — Ruoyu Qin, Zhenheng Tang, Weian Zhao, Zeming Chen, and collaborators, 2024
https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+—+KVCache-centric+Architecture+for+Serving+LLM+Chatbot
3. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Ying Sheng, Yilun Du, and collaborators, 2024
https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving
4. TetriInfer: Reconfigurable Inference Serving for Long-Context LLMs with KV Cache Dynamics — Authors vary by version; commonly cited as a systems paper from 2024 on long-context serving, 2024
https://scholar.google.com/scholar?q=TetriInfer:+Reconfigurable+Inference+Serving+for+Long-Context+LLMs+with+KV+Cache+Dynamics
5. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter — Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang, 2026
https://scholar.google.com/scholar?q=Prefill-as-a-Service:+KVCache+of+Next-Generation+Models+Could+Go+Cross-Datacenter
6. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Akhil Agrawal and collaborators, 2024
https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills
7. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Authors commonly cited from the phase-splitting serving literature in 2024, 2024
https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting
8. Mooncake — Referenced as [22] in the paper, Unknown from excerpt
https://scholar.google.com/scholar?q=Mooncake
9. vLLM — Kwon et al. and collaborators, 2023
https://scholar.google.com/scholar?q=vLLM
10. SGLang — Referenced as [7] in the paper, Likely 2024
https://scholar.google.com/scholar?q=SGLang
11. Dynamo — Referenced as [20] in the paper, Likely 2024 or 2025
https://scholar.google.com/scholar?q=Dynamo
12. Kimi Linear — Referenced as [26] in the paper, Likely 2025 or 2026
https://scholar.google.com/scholar?q=Kimi+Linear
13. Ring-2.5-1T — Referenced as [3] in the paper, Likely 2025 or 2026
https://scholar.google.com/scholar?q=Ring-2.5-1T
14. Lightning — Referenced as [23] in the paper, Unknown from excerpt
https://scholar.google.com/scholar?q=Lightning
15. Multi-Head Latent Attention (MLA) — Referenced as [12] in the paper, Unknown from excerpt
https://scholar.google.com/scholar?q=Multi-Head+Latent+Attention+(MLA)
16. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — approx. enterprise systems / LLM serving authors, 2024/2025
https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference
17. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems — approx. systems authors, 2024/2025
https://scholar.google.com/scholar?q=HotPrefix:+Hotness-Aware+KV+Cache+Scheduling+for+Efficient+Prefix+Sharing+in+LLM+Inference+Systems
18. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse — approx. systems authors, 2024/2025
https://scholar.google.com/scholar?q=KVShare:+An+LLM+Service+System+with+Efficient+and+Effective+Multi-Tenant+KV+Cache+Reuse
19. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — approx. security/systems authors, 2024/2025
https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference
20. CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems — approx. security authors, 2024/2025
https://scholar.google.com/scholar?q=CacheSolidarity:+Preventing+Prefix+Caching+Side+Channels+in+Multi-tenant+LLM+Serving+Systems
21. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-Based Dynamic Scheduling — approx. systems authors, 2024/2025
https://scholar.google.com/scholar?q=WindServe:+Efficient+Phase-Disaggregated+LLM+Serving+with+Stream-Based+Dynamic+Scheduling
22. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving — approx. edge/serving authors, 2024/2025
https://scholar.google.com/scholar?q=SLED:+A+Speculative+LLM+Decoding+Framework+for+Efficient+Edge+Serving
23. LServe: Efficient Long-Sequence LLM Serving with Unified Sparse Attention — approx. systems/ML authors, 2024/2025
https://scholar.google.com/scholar?q=LServe:+Efficient+Long-Sequence+LLM+Serving+with+Unified+Sparse+Attention
24. On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention for Long-Context LLM Serving — approx. ML systems authors, 2024/2025
https://scholar.google.com/scholar?q=On-the-Fly+Adaptive+Distillation+of+Transformer+to+Dual-State+Linear+Attention+for+Long-Context+LLM+Serving
25. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
26. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3
27. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3
28. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/
29. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3
30. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/
31. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3
32. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3
Interactive Visualization: Prefill-as-a-Service for Cross-Datacenter KV Cache