Hal Turing and Dr. Ada Shannon examine FengHuang: Next-Generation Memory Orchestration for AI Inferencing, a 2025 Microsoft Research vision paper that asks a blunt systems question: should LLM serving keep revolving around GPU-local HBM, or is it time to treat rack-scale remote memory as a first-class inference substrate? They unpack why inference increasingly looks memory-bound rather than purely compute-bound, from giant model weights to ever-expanding KV caches and the communication overhead of splitting models across devices. The discussion frames TAB, the Tensor Addressable Bridge, as an attempt to decouple usable memory capacity from individual GPUs so operators do not have to keep buying extra accelerators just to store tensors.
The episode gets specific about the proposed architecture: a disaggregated, tiered memory design where local HBM remains the fast “hot” tier, while a larger remote memory pool holds colder or bulkier tensors nearby at rack scale. Hal and Ada walk through what memory disaggregation means in practical terms, why conventional model-parallel inference becomes structurally wasteful, and how TAB is supposed to let a rack behave more like a shared memory machine for tensor access. They also focus on the paper’s execution model, especially active tensor paging and the tensor prefetcher, which tries to move tensors into the right tier before a miss forces the GPU to stall.
Throughout, the hosts keep the paper’s claims under pressure. Ada highlights that FengHuang is presented as a vision report with simulation-based validation rather than a production deployment, and both hosts scrutinize whether the promised latency and throughput gains can survive real-world data-movement costs. They push back on simplistic “compute no longer matters” narratives, arguing instead that the core issue is the economic and architectural mismatch of using GPU scale-out to solve memory problems. The result is a grounded conversation about whether TAB represents a credible path to cheaper, more scalable inference—or just another reminder that data movement remains the real tax collector of AI systems.
Sources:
1. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025
http://arxiv.org/abs/2511.10753
2. GPUDirect Storage — NVIDIA, 2024
https://scholar.google.com/scholar?q=GPUDirect+Storage
3. AMD Instinct GPU architecture and platform materials — AMD, 2024
https://scholar.google.com/scholar?q=AMD+Instinct+GPU+architecture+and+platform+materials
4. Google TPU system architecture materials — Google, 2024
https://scholar.google.com/scholar?q=Google+TPU+system+architecture+materials
5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019
https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism
6. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023
https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention
7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022
https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness
8. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, et al., 2021
https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning
9. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ying et al., 2024
https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving
10. PyramidInfer: Pyramid KV Cache Compression for High-Throughput LLM Inference — approx. recent LLM systems authors, 2024/2025
https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-Throughput+LLM+Inference
11. Inference-Time Hyper-Scaling with KV Cache Compression — approx. recent LLM inference authors, 2024/2025
https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression
12. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache — approx. recent LLM compression authors, 2025
https://scholar.google.com/scholar?q=Q-Hitter:+A+Better+Token+Oracle+for+Efficient+LLM+Inference+via+Sparse-Quantized+KV+Cache
13. SPQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Dettmers et al. / approximate, 2023
https://scholar.google.com/scholar?q=SPQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression
14. Enabling Dynamic Sparsity in Quantized LLM Inference — approx. recent sparse-quantized inference authors, 2024/2025
https://scholar.google.com/scholar?q=Enabling+Dynamic+Sparsity+in+Quantized+LLM+Inference
15. Iso: Overlap of Computation and Communication Within Sequence for LLM Inference — approx. recent distributed inference authors, 2025
https://scholar.google.com/scholar?q=Iso:+Overlap+of+Computation+and+Communication+Within+Sequence+for+LLM+Inference
16. Throughput Maximization for Transformer Inference on Processing Near-Memory Architectures — approx. recent PNM authors, 2024/2025
https://scholar.google.com/scholar?q=Throughput+Maximization+for+Transformer+Inference+on+Processing+Near-Memory+Architectures
17. Improving Computation and Memory Efficiency for Real-World Transformer Inference on GPUs — approx. recent GPU systems authors, 2024/2025
https://scholar.google.com/scholar?q=Improving+Computation+and+Memory+Efficiency+for+Real-World+Transformer+Inference+on+GPUs
18. Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference — approx. survey authors, 2024/2025
https://scholar.google.com/scholar?q=Memory+Is+All+You+Need:+An+Overview+of+Compute-in-Memory+Architectures+for+Accelerating+Large+Language+Model+Inference
19. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3
20. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/
21. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3
22. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3
23. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
24. AI Post Transformers: AI and the Memory Wall: Overcoming Bottlenecks — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/ai-and-the-memory-wall-overcoming-bottlenecks/
25. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/
26. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3
27. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/
28. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3
Interactive Visualization: FengHuang for Rack-Scale LLM Inference Memory