April 10, 2026

Computation-Bandwidth-Memory Trade-offs for AI Infrastructure

This episode explores a systems paper that argues AI infrastructure should treat computation, interconnect bandwidth, and memory as a single joint design space rather than three separate bottlenecks. It explains the paper’s “AI Trinity” framework and walks through the main trade-offs: using extra compute to reduce communication, using networked or disaggregated memory to ease local memory limits, and using caching or stored intermediates to avoid recomputation. The discussion connects that framing to real AI practice, from distributed training bottlenecked by all-reduce bandwidth to inference constrained by KV-cache memory, while grounding it in broader ideas like scaling laws, the “Bitter Lesson,” FlashAttention’s IO-aware design, and the roofline model. A listener would find it interesting because it translates familiar pains—GPU memory ceilings, gradient traffic, and hardware inefficiency—into a clearer systems-level way of thinking about how modern AI actually scales.

Sources:

1. Computation-Bandwidth-Memory Trade-offs: A Unified Paradigm for AI Infrastructure — Yuankai Fan, Qizhen Weng, Xuelong Li, 2025

http://arxiv.org/abs/2601.11577

2. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2020

https://scholar.google.com/scholar?q=ZeRO:+Memory+Optimizations+Toward+Training+Trillion+Parameter+Models

3. Training Deep Nets with Sublinear Memory Cost — Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016

https://scholar.google.com/scholar?q=Training+Deep+Nets+with+Sublinear+Memory+Cost

4. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design — Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler, 2016

https://scholar.google.com/scholar?q=vDNN:+Virtualized+Deep+Neural+Networks+for+Scalable,+Memory-Efficient+Neural+Network+Design

5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019

https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism

6. Reducing Activation Recomputation in Large Transformer Models — Benjamin L. Kirby, Jackson Kernion, et al., 2024

https://scholar.google.com/scholar?q=Reducing+Activation+Recomputation+in+Large+Transformer+Models

7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

8. Split Computing for Mobile Deep Inference: Survey and Research Directions — Yiping Kang, Johan Hauswald, et al. / related survey literature, 2020

https://scholar.google.com/scholar?q=Split+Computing+for+Mobile+Deep+Inference:+Survey+and+Research+Directions

9. Learning-Based Video Compression — Oren Rippel, Lubomir Bourdev, 2017

https://scholar.google.com/scholar?q=Learning-Based+Video+Compression

10. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training — Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally, 2018

https://scholar.google.com/scholar?q=Deep+Gradient+Compression:+Reducing+the+Communication+Bandwidth+for+Distributed+Training

11. Accelerating Diffusion Models with Cache-Based or Feature Reuse Methods (e.g., DeepCache / related 2024 diffusion caching work) — Various, 2024

https://scholar.google.com/scholar?q=Accelerating+Diffusion+Models+with+Cache-Based+or+Feature+Reuse+Methods+(e.g.,+DeepCache+/+related+2024+diffusion+caching+work)

12. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/serving authors, 2024/2025

https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads

13. StreamKV: Streaming Video Question-Answering with Segment-Based KV Cache Retrieval and Compression — approx. recent multimodal/LLM authors, 2024/2025

https://scholar.google.com/scholar?q=StreamKV:+Streaming+Video+Question-Answering+with+Segment-Based+KV+Cache+Retrieval+and+Compression

14. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM inference authors, 2024/2025

https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning

15. In-Network Aggregation with Transport Transparency for Distributed Training — approx. systems/networking authors, 2023/2024

https://scholar.google.com/scholar?q=In-Network+Aggregation+with+Transport+Transparency+for+Distributed+Training

16. GRID: Gradient Routing with In-Network Aggregation for Distributed Training — approx. systems/networking authors, 2024/2025

https://scholar.google.com/scholar?q=GRID:+Gradient+Routing+with+In-Network+Aggregation+for+Distributed+Training

17. InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training — approx. systems/networking authors, 2024/2025

https://scholar.google.com/scholar?q=InArt:+In-Network+Aggregation+with+Route+Selection+for+Accelerating+Distributed+Training

18. PrivyNAS: Privacy-Aware Neural Architecture Search for Split Computing in Edge-Cloud Systems — approx. edge AI / NAS authors, 2024/2025

https://scholar.google.com/scholar?q=PrivyNAS:+Privacy-Aware+Neural+Architecture+Search+for+Split+Computing+in+Edge-Cloud+Systems

19. Advancements and Challenges in Privacy-Preserving Split Learning: Experimental Findings and Future Directions — approx. survey/review authors, 2024/2025

https://scholar.google.com/scholar?q=Advancements+and+Challenges+in+Privacy-Preserving+Split+Learning:+Experimental+Findings+and+Future+Directions

20. Lightweight User-Personalization Method for Closed Split Computing — approx. split-computing authors, 2024/2025

https://scholar.google.com/scholar?q=Lightweight+User-Personalization+Method+for+Closed+Split+Computing

21. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3

22. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3

23. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/

24. AI Post Transformers: Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3

25. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3

26. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3

27. AI Post Transformers: Paris: Decentralized Open-Weight Diffusion Model — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/paris-decentralized-open-weight-diffusion-model/

Interactive Visualization: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure

...more

View all episodes

By mcgrof