March 07, 2026

DualPath Breaks Storage Bandwidth Bottleneck in Agentic Inference

Hal Turing and Dr. Ada Shannon dig into "DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference," a February 2026 paper from a thirteen-author team spanning Peking University, Tsinghua University, and DeepSeek-AI. The episode opens with a striking observation from production: on disaggregated inference clusters running agentic workloads, prefill machines saturate their storage network interfaces at 100% utilization while the equivalent hardware on decode machines sits nearly idle. The hosts use this asymmetry as a lens into a counterintuitive reality — H100 GPUs throttled to 40% compute utilization not by arithmetic limits, but by a storage NIC.

Ada explains the structural reason agentic workloads are uniquely hostile to existing infrastructure: the short-append pattern. Unlike standard multi-turn chat, agentic sessions accumulate dozens to hundreds of turns where each round appends only a small number of tokens — a tool result, a stack trace, a code output — onto a context that may already span tens of thousands of tokens. Because that prior context never changes, its KV-Cache was computed once and stored. DeepSeek's production traces show KV-Cache hit rates of 95% or higher, meaning the dominant cost shifts from GPU computation to storage I/O: loading gigabytes of persistent key-value state from external NVMe-backed distributed storage, layer by layer, into prefill engines via RDMA. Hal presses on that 95% figure specifically, establishing that it is grounded in real production traffic rather than idealized assumptions — a distinction that determines whether storage bandwidth or GPU compute is the correct optimization target.

The episode frames DualPath's core insight against this background: the storage NICs on decode engines represent idle bandwidth that could absorb KV-Cache load traffic currently overwhelming prefill-side storage interfaces. By routing that traffic through decode-side hardware and transferring it to prefill engines over RDMA, DualPath breaks the single-path bottleneck without adding new hardware. The hosts connect this to the broader memory wall argument — that as context lengths grow and agentic sessions deepen, the architectural shift toward disaggregated inference is not optional, and the constraints driving system design are increasingly about data movement rather than floating-point throughput. DualPath's reported throughput improvement of up to 1.96x is presented as evidence that exploiting idle hardware asymmetries, rather than scaling compute, is where near-term agentic inference gains will be found.

Sources:

1. DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang, 2026

http://arxiv.org/abs/2602.21548

2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, 2023

https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention

3. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph Gonzalez, Clark Barrett, Ying Sheng, 2024

https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs

4. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024

https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving

5. CacheGen: Fast Context Loading for Language Model Applications — Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang, 2024

https://scholar.google.com/scholar?q=CacheGen:+Fast+Context+Loading+for+Language+Model+Applications

6. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024

https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving

7. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Igo Goiri, Saeed Maleki, Ricardo Bianchini, 2024

https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting

8. Sarathi-Serve: Chunked Prefills for Efficient LLM Inference — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024

https://scholar.google.com/scholar?q=Sarathi-Serve:+Chunked+Prefills+for+Efficient+LLM+Inference

9. Llumnix: Dynamic Scheduling for Large Language Model Serving — Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin, 2024

https://scholar.google.com/scholar?q=Llumnix:+Dynamic+Scheduling+for+Large+Language+Model+Serving

10. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang, 2023

https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU

11. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 2024

https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management

12. Parrot: Efficient Servicing of LLM-based Applications with Semantic Variable — Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 2024

https://scholar.google.com/scholar?q=Parrot:+Efficient+Servicing+of+LLM-based+Applications+with+Semantic+Variable

13. AgentBench: Evaluating LLMs as Agents — Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhiyuan Liu, Yuxiao Dong, Jie Tang, 2023

https://scholar.google.com/scholar?q=AgentBench:+Evaluating+LLMs+as+Agents

14. MemGPT: Towards LLMs as Operating Systems — Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph Gonzalez, 2023

https://scholar.google.com/scholar?q=MemGPT:+Towards+LLMs+as+Operating+Systems

15. Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Amey Agrawal, Nikhil Mohan, Sudhanshu Panwar, et al., 2024

https://scholar.google.com/scholar?q=Sarathi-Serve:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills

16. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, et al., 2024

https://scholar.google.com/scholar?q=CacheBlend:+Fast+Large+Language+Model+Serving+for+RAG+with+Cached+Knowledge+Fusion

17. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Liu et al., 2024

https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving

18. Compute or Load KV Cache? Why Not Both? — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Compute+or+Load+KV+Cache?+Why+Not+Both?

19. Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Semi-PD:+Towards+Efficient+LLM+Serving+via+Phase-Wise+Disaggregated+Computation+and+Unified+Storage

20. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving (TaiChi) — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving+(TaiChi)

21. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=Continuum:+Efficient+and+Robust+Multi-Turn+LLM+Agent+Scheduling+with+KV+Cache+Time-to-Live

22. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning

23. RDMA Point-to-Point Communication for LLM Systems — Unknown (snippet only), 2024-2025

https://scholar.google.com/scholar?q=RDMA+Point-to-Point+Communication+for+LLM+Systems

24. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, Wed,

https://podcasters.spotify.com/pod/show/12146088098/episodes/FAST26-Bidaw-Enhancing-Key-Value-Caching-for-Interactive-LLM-Serving-via-Bidirectional-e3fjgkh

25. AI Post Transformers: SYMPHONY: Memory Management for LLM Multi-Turn Inference — Hal Turing & Dr. Ada Shannon, Mon,

https://podcasters.spotify.com/pod/show/12146088098/episodes/SYMPHONY-Memory-Management-for-LLM-Multi-Turn-Inference-e3ap0pf

26. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, Sat,

https://podcasters.spotify.com/pod/show/12146088098/episodes/CXL-SpecKV-Bridging-the-LLM-Memory-Wall-with-Speculative-FPGA-Disaggregation-e3foad0

27. AI Post Transformers: AI and the Memory Wall: Overcoming Bottlenecks — Hal Turing & Dr. Ada Shannon, Fri,

https://podcasters.spotify.com/pod/show/12146088098/episodes/AI-and-the-Memory-Wall-Overcoming-Bottlenecks-e36ki0u

28. AI Post Transformers: Tempo: SLO-Aware LLM Serving Maximizing Service Gain — Hal Turing & Dr. Ada Shannon, Mon,

https://podcasters.spotify.com/pod/show/12146088098/episodes/Tempo-SLO-Aware-LLM-Serving-Maximizing-Service-Gain-e3ap12a

Interactive Visualization: DualPath Breaks Storage Bandwidth Bottleneck in Agentic Inference

...more

View all episodes

By mcgrof