This episode explores a systems paper that argues AI infrastructure should treat computation, interconnect bandwidth, and memory as a single joint design space rather than three separate bottlenecks. It explains the paper’s “AI Trinity” framework and walks through the main trade-offs: using extra compute to reduce communication, using networked or disaggregated memory to ease local memory limits, and using caching or stored intermediates to avoid recomputation. The discussion connects that framing to real AI practice, from distributed training bottlenecked by all-reduce bandwidth to inference constrained by KV-cache memory, while grounding it in broader ideas like scaling laws, the “Bitter Lesson,” FlashAttention’s IO-aware design, and the roofline model. A listener would find it interesting because it translates familiar pains—GPU memory ceilings, gradient traffic, and hardware inefficiency—into a clearer systems-level way of thinking about how modern AI actually scales.
Sources:
1. Computation-Bandwidth-Memory Trade-offs: A Unified Paradigm for AI Infrastructure — Yuankai Fan, Qizhen Weng, Xuelong Li, 2025
http://arxiv.org/abs/2601.11577
2. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2020
https://scholar.google.com/scholar?q=ZeRO:+Memory+Optimizations+Toward+Training+Trillion+Parameter+Models
3. Training Deep Nets with Sublinear Memory Cost — Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016
https://scholar.google.com/scholar?q=Training+Deep+Nets+with+Sublinear+Memory+Cost
4. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design — Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler, 2016
https://scholar.google.com/scholar?q=vDNN:+Virtualized+Deep+Neural+Networks+for+Scalable,+Memory-Efficient+Neural+Network+Design
5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019
https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism
6. Reducing Activation Recomputation in Large Transformer Models — Benjamin L. Kirby, Jackson Kernion, et al., 2024
https://scholar.google.com/scholar?q=Reducing+Activation+Recomputation+in+Large+Transformer+Models
7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
8. Split Computing for Mobile Deep Inference: Survey and Research Directions — Yiping Kang, Johan Hauswald, et al. / related survey literature, 2020
https://scholar.google.com/scholar?q=Split+Computing+for+Mobile+Deep+Inference:+Survey+and+Research+Directions
9. Learning-Based Video Compression — Oren Rippel, Lubomir Bourdev, 2017
https://scholar.google.com/scholar?q=Learning-Based+Video+Compression
10. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training — Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally, 2018
https://scholar.google.com/scholar?q=Deep+Gradient+Compression:+Reducing+the+Communication+Bandwidth+for+Distributed+Training
11. Accelerating Diffusion Models with Cache-Based or Feature Reuse Methods (e.g., DeepCache / related 2024 diffusion caching work) — Various, 2024
https://scholar.google.com/scholar?q=Accelerating+Diffusion+Models+with+Cache-Based+or+Feature+Reuse+Methods+(e.g.,+DeepCache+/+related+2024+diffusion+caching+work)
12. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/serving authors, 2024/2025
https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads
13. StreamKV: Streaming Video Question-Answering with Segment-Based KV Cache Retrieval and Compression — approx. recent multimodal/LLM authors, 2024/2025
https://scholar.google.com/scholar?q=StreamKV:+Streaming+Video+Question-Answering+with+Segment-Based+KV+Cache+Retrieval+and+Compression
14. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM inference authors, 2024/2025
https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning
15. In-Network Aggregation with Transport Transparency for Distributed Training — approx. systems/networking authors, 2023/2024
https://scholar.google.com/scholar?q=In-Network+Aggregation+with+Transport+Transparency+for+Distributed+Training
16. GRID: Gradient Routing with In-Network Aggregation for Distributed Training — approx. systems/networking authors, 2024/2025
https://scholar.google.com/scholar?q=GRID:+Gradient+Routing+with+In-Network+Aggregation+for+Distributed+Training
17. InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training — approx. systems/networking authors, 2024/2025
https://scholar.google.com/scholar?q=InArt:+In-Network+Aggregation+with+Route+Selection+for+Accelerating+Distributed+Training
18. PrivyNAS: Privacy-Aware Neural Architecture Search for Split Computing in Edge-Cloud Systems — approx. edge AI / NAS authors, 2024/2025
https://scholar.google.com/scholar?q=PrivyNAS:+Privacy-Aware+Neural+Architecture+Search+for+Split+Computing+in+Edge-Cloud+Systems
19. Advancements and Challenges in Privacy-Preserving Split Learning: Experimental Findings and Future Directions — approx. survey/review authors, 2024/2025
https://scholar.google.com/scholar?q=Advancements+and+Challenges+in+Privacy-Preserving+Split+Learning:+Experimental+Findings+and+Future+Directions
20. Lightweight User-Personalization Method for Closed Split Computing — approx. split-computing authors, 2024/2025
https://scholar.google.com/scholar?q=Lightweight+User-Personalization+Method+for+Closed+Split+Computing
21. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3
22. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3
23. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/
24. AI Post Transformers: Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
25. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3
26. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3
27. AI Post Transformers: Paris: Decentralized Open-Weight Diffusion Model — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/paris-decentralized-open-weight-diffusion-model/
Interactive Visualization: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure