April 20, 2026

Prefill-as-a-Service for Cross-Datacenter KV Cache

19 minutes

This episode explores whether newer hybrid-attention language models make prefill-decode disaggregation practical across clusters or even datacenters by shrinking the KV cache enough to move it over ordinary Ethernet. It explains why the real production bottleneck is not request routing but transferring the attention state between prefill and decode, and contrasts dense transformers—where KV cache grows heavily with context across many layers—with hybrid designs that use fewer full-attention layers and more bounded-state alternatives. The discussion highlights the paper’s central claim that smaller KV footprints could enable remote, compute-dense prefill clusters and local decode clusters, especially for long, uncached prompts, while also questioning how broadly that conclusion generalizes given the evidence comes from a single internal 1-trillion-parameter model. Listeners would find it interesting for its concrete systems view of where disaggregated inference actually breaks, and for its argument that model architecture—not just serving software—may determine whether cross-cluster AI serving is viable.

Sources:

1. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter — Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang, 2026

http://arxiv.org/abs/2604.15039

2. Mooncake: Trading More Storage for Less Computation — KVCache-centric Architecture for Serving LLM Chatbot — Ruoyu Qin, Zhenheng Tang, Weian Zhao, Zeming Chen, and collaborators, 2024

https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+—+KVCache-centric+Architecture+for+Serving+LLM+Chatbot

3. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Ying Sheng, Yilun Du, and collaborators, 2024

https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving

4. TetriInfer: Reconfigurable Inference Serving for Long-Context LLMs with KV Cache Dynamics — Authors vary by version; commonly cited as a systems paper from 2024 on long-context serving, 2024

https://scholar.google.com/scholar?q=TetriInfer:+Reconfigurable+Inference+Serving+for+Long-Context+LLMs+with+KV+Cache+Dynamics

5. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter — Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, Mingxing Zhang, 2026

https://scholar.google.com/scholar?q=Prefill-as-a-Service:+KVCache+of+Next-Generation+Models+Could+Go+Cross-Datacenter

6. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Akhil Agrawal and collaborators, 2024

https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills

7. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Authors commonly cited from the phase-splitting serving literature in 2024, 2024

https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting

8. Mooncake — Referenced as [22] in the paper, Unknown from excerpt

https://scholar.google.com/scholar?q=Mooncake

9. vLLM — Kwon et al. and collaborators, 2023

https://scholar.google.com/scholar?q=vLLM

10. SGLang — Referenced as [7] in the paper, Likely 2024

https://scholar.google.com/scholar?q=SGLang

11. Dynamo — Referenced as [20] in the paper, Likely 2024 or 2025

https://scholar.google.com/scholar?q=Dynamo

12. Kimi Linear — Referenced as [26] in the paper, Likely 2025 or 2026

https://scholar.google.com/scholar?q=Kimi+Linear

13. Ring-2.5-1T — Referenced as [3] in the paper, Likely 2025 or 2026

https://scholar.google.com/scholar?q=Ring-2.5-1T

14. Lightning — Referenced as [23] in the paper, Unknown from excerpt

https://scholar.google.com/scholar?q=Lightning

15. Multi-Head Latent Attention (MLA) — Referenced as [12] in the paper, Unknown from excerpt

https://scholar.google.com/scholar?q=Multi-Head+Latent+Attention+(MLA)

16. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — approx. enterprise systems / LLM serving authors, 2024/2025

https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference

17. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems — approx. systems authors, 2024/2025

https://scholar.google.com/scholar?q=HotPrefix:+Hotness-Aware+KV+Cache+Scheduling+for+Efficient+Prefix+Sharing+in+LLM+Inference+Systems

18. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse — approx. systems authors, 2024/2025

https://scholar.google.com/scholar?q=KVShare:+An+LLM+Service+System+with+Efficient+and+Effective+Multi-Tenant+KV+Cache+Reuse

19. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — approx. security/systems authors, 2024/2025

https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference

20. CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems — approx. security authors, 2024/2025

https://scholar.google.com/scholar?q=CacheSolidarity:+Preventing+Prefix+Caching+Side+Channels+in+Multi-tenant+LLM+Serving+Systems

21. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-Based Dynamic Scheduling — approx. systems authors, 2024/2025

https://scholar.google.com/scholar?q=WindServe:+Efficient+Phase-Disaggregated+LLM+Serving+with+Stream-Based+Dynamic+Scheduling

22. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving — approx. edge/serving authors, 2024/2025

https://scholar.google.com/scholar?q=SLED:+A+Speculative+LLM+Decoding+Framework+for+Efficient+Edge+Serving

23. LServe: Efficient Long-Sequence LLM Serving with Unified Sparse Attention — approx. systems/ML authors, 2024/2025

https://scholar.google.com/scholar?q=LServe:+Efficient+Long-Sequence+LLM+Serving+with+Unified+Sparse+Attention

24. On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention for Long-Context LLM Serving — approx. ML systems authors, 2024/2025

https://scholar.google.com/scholar?q=On-the-Fly+Adaptive+Distillation+of+Transformer+to+Dual-State+Linear+Attention+for+Long-Context+LLM+Serving

25. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

26. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3

27. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3

28. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/

29. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3

30. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/

31. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3

32. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3

Interactive Visualization: Prefill-as-a-Service for Cross-Datacenter KV Cache

...more

View all episodes

By mcgrof