March 01, 2026

DualPath: Breaking the Storage Wall

21 minutes

Episode Summary

A deep dive into DualPath, a system that solves the storage bandwidth bottleneck in agentic LLM inference — then a scale-by-scale walkthrough of how the same bottleneck affects everyone from Raspberry Pi clusters to DGX SuperPods. As AI agents run multi-turn sessions (100+ turns, 95%+ KV-cache reuse), the bottleneck shifts from compute to storage I/O. DualPath exploits idle decode-engine storage NICs to load KV-cache and transfer it via RDMA to prefill engines, achieving 1.87x offline throughput and 1.96x online serving improvements. We break down the architecture, then walk from RPi5 to Mac mini to DGX Spark to production, showing where the diagnosis applies universally and where the specific cure requires datacenter hardware.

Paper Discussed

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

arXiv:2602.21548 — HTML version

Authors: Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

Affiliations: Peking University, Tsinghua University, DeepSeek-AI

Hardware Scale Walkthrough

Raspberry Pi 5 Cluster

~30 TOPS NPU, Gigabit Ethernet, USB 3.0 storage

Same I/O bottleneck physics, no RDMA or traffic isolation available

Diagnosis applies; cure doesn't

Mac mini M4 / Mac Studio

16-32GB unified memory, Thunderbolt 4 (40Gbps bidirectional)

Single bus carries all traffic — no compute/storage network separation

Thunderbolt 5 at 120Gbps starts to change the equation

DGX Spark Cluster

8x Sparks: 128GB each, 1TB total, ConnectX-7 with real RDMA

Two MikroTik switches: one compute network, one storage network

4 prefill + 4 decode engines (1:1 P/D ratio — middle of bottleneck-free range)

~$30K all-in (8 × $3K Sparks + ~$2,600 switches + cables)

DGX Spark home cluster build video — 6,367 tok/s on Qwen 34B BF16

This is where DualPath's architecture becomes directly feasible

QSFP28 vs QSFP56 cable differences matter for bandwidth

Production Scale (Paper's Target)

DGX SuperPOD: 8 GPUs/node, 8x 400Gbps CNICs, 1x 400Gbps SNIC

Physically isolated compute and storage networks

Full DualPath: 1.87x offline, 1.96x online throughput

Key Concepts

Prefill-Decode Disaggregation — Separating prompt processing from token generation onto dedicated engine pools. See DistServe.

KV-Cache — Cached attention keys and values, stored to avoid recomputation on subsequent turns.

Cache-Compute Ratio — GB of KV-cache to load per PFLOP of compute. The universal diagnostic for whether you're I/O-bound or compute-bound.

RDMA — Remote Direct Memory Access. Direct memory-to-memory transfer without CPU involvement.

Layerwise Prefill — Per-layer KV-cache loading to overcome HBM limits. See LayerKV.

3FS — DeepSeek's distributed file system. GitHub.

InfiniBand Virtual Lanes — Hardware QoS for traffic isolation.

Key Numbers

Metric

Value

Avg agent turns (production traces)

157

Avg append tokens per turn

429

KV-cache hit rate

98.7%

Cache-compute ratio (DeepSeek V3.2)

13–36 GB/PFLOP

Cache-compute ratio (Qwen 32B, FP16)

117–267 GB/PFLOP

Offline throughput improvement

up to 1.87x

Online serving throughput improvement

1.96x average

I/O-compute ratio degradation (Ampere→Blackwell)

14.4x

Bottleneck-free P/D ratio range

1/7 to 7/2

Scale tested

up to 1,152 GPUs

Related Work

Mooncake: KV Cache-Centric LLM Serving — DRAM-based caching approach

DistServe: Disaggregating Prefill and Decoding — PD disaggregation

DeepSeek-V3 Technical Report — Model architecture

FlashMLA — Efficient attention kernel

DeepEP — Expert parallel communication

SGLang — Structured LLM serving

Models Evaluated

DeepSeek V3.2 660B (MoE with sparse attention)

DS 27B (downscaled V3.2)

Qwen2.5-32B (dense, GQA)

Author Profiles

Yinmin Zhong — Peking University

Xin Jin — Peking University

Mingxing Zhang — Tsinghua University

This episode was generated with AI assistance.

...more

View all episodes

By Daily Tech Feed

March 01, 2026

DualPath: Breaking the Storage Wall

21 minutes

DualPath: Breaking the Storage Wall

Episode Summary

Paper Discussed

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

arXiv:2602.21548 — HTML version

Authors: Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

Affiliations: Peking University, Tsinghua University, DeepSeek-AI

Hardware Scale Walkthrough

Raspberry Pi 5 Cluster

~30 TOPS NPU, Gigabit Ethernet, USB 3.0 storage

Same I/O bottleneck physics, no RDMA or traffic isolation available

Diagnosis applies; cure doesn't

Mac mini M4 / Mac Studio

16-32GB unified memory, Thunderbolt 4 (40Gbps bidirectional)

Single bus carries all traffic — no compute/storage network separation

Thunderbolt 5 at 120Gbps starts to change the equation

DGX Spark Cluster

8x Sparks: 128GB each, 1TB total, ConnectX-7 with real RDMA

Two MikroTik switches: one compute network, one storage network

4 prefill + 4 decode engines (1:1 P/D ratio — middle of bottleneck-free range)

~$30K all-in (8 × $3K Sparks + ~$2,600 switches + cables)

DGX Spark home cluster build video — 6,367 tok/s on Qwen 34B BF16

This is where DualPath's architecture becomes directly feasible

QSFP28 vs QSFP56 cable differences matter for bandwidth

Production Scale (Paper's Target)

DGX SuperPOD: 8 GPUs/node, 8x 400Gbps CNICs, 1x 400Gbps SNIC

Physically isolated compute and storage networks

Full DualPath: 1.87x offline, 1.96x online throughput

Key Concepts

Prefill-Decode Disaggregation — Separating prompt processing from token generation onto dedicated engine pools. See DistServe.

KV-Cache — Cached attention keys and values, stored to avoid recomputation on subsequent turns.

Cache-Compute Ratio — GB of KV-cache to load per PFLOP of compute. The universal diagnostic for whether you're I/O-bound or compute-bound.

RDMA — Remote Direct Memory Access. Direct memory-to-memory transfer without CPU involvement.

Layerwise Prefill — Per-layer KV-cache loading to overcome HBM limits. See LayerKV.

3FS — DeepSeek's distributed file system. GitHub.

InfiniBand Virtual Lanes — Hardware QoS for traffic isolation.

Key Numbers

Metric

Value

Avg agent turns (production traces)

157

Avg append tokens per turn

429

KV-cache hit rate

98.7%

Cache-compute ratio (DeepSeek V3.2)

13–36 GB/PFLOP

Cache-compute ratio (Qwen 32B, FP16)

117–267 GB/PFLOP

Offline throughput improvement

up to 1.87x

Online serving throughput improvement

1.96x average

I/O-compute ratio degradation (Ampere→Blackwell)

14.4x

Bottleneck-free P/D ratio range

1/7 to 7/2

Scale tested

up to 1,152 GPUs

Related Work

Mooncake: KV Cache-Centric LLM Serving — DRAM-based caching approach

DistServe: Disaggregating Prefill and Decoding — PD disaggregation

DeepSeek-V3 Technical Report — Model architecture

FlashMLA — Efficient attention kernel

DeepEP — Expert parallel communication

SGLang — Structured LLM serving

Models Evaluated

DeepSeek V3.2 660B (MoE with sparse attention)

DS 27B (downscaled V3.2)

Qwen2.5-32B (dense, GQA)

Author Profiles

Yinmin Zhong — Peking University

Xin Jin — Peking University

Mingxing Zhang — Tsinghua University

This episode was generated with AI assistance.

...more

Share DualPath: Breaking the Storage Wall

Sign up to save your podcasts

DualPath: Breaking the Storage Wall

DualPath: Breaking the Storage Wall