April 12, 2026

FengHuang for Rack-Scale LLM Inference Memory

Hal Turing and Dr. Ada Shannon examine FengHuang: Next-Generation Memory Orchestration for AI Inferencing, a 2025 Microsoft Research vision paper that asks a blunt systems question: should LLM serving keep revolving around GPU-local HBM, or is it time to treat rack-scale remote memory as a first-class inference substrate? They unpack why inference increasingly looks memory-bound rather than purely compute-bound, from giant model weights to ever-expanding KV caches and the communication overhead of splitting models across devices. The discussion frames TAB, the Tensor Addressable Bridge, as an attempt to decouple usable memory capacity from individual GPUs so operators do not have to keep buying extra accelerators just to store tensors.

The episode gets specific about the proposed architecture: a disaggregated, tiered memory design where local HBM remains the fast “hot” tier, while a larger remote memory pool holds colder or bulkier tensors nearby at rack scale. Hal and Ada walk through what memory disaggregation means in practical terms, why conventional model-parallel inference becomes structurally wasteful, and how TAB is supposed to let a rack behave more like a shared memory machine for tensor access. They also focus on the paper’s execution model, especially active tensor paging and the tensor prefetcher, which tries to move tensors into the right tier before a miss forces the GPU to stall.

Throughout, the hosts keep the paper’s claims under pressure. Ada highlights that FengHuang is presented as a vision report with simulation-based validation rather than a production deployment, and both hosts scrutinize whether the promised latency and throughput gains can survive real-world data-movement costs. They push back on simplistic “compute no longer matters” narratives, arguing instead that the core issue is the economic and architectural mismatch of using GPU scale-out to solve memory problems. The result is a grounded conversation about whether TAB represents a credible path to cheaper, more scalable inference—or just another reminder that data movement remains the real tax collector of AI systems.

Sources:

1. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025

http://arxiv.org/abs/2511.10753

2. GPUDirect Storage — NVIDIA, 2024

https://scholar.google.com/scholar?q=GPUDirect+Storage

3. AMD Instinct GPU architecture and platform materials — AMD, 2024

https://scholar.google.com/scholar?q=AMD+Instinct+GPU+architecture+and+platform+materials

4. Google TPU system architecture materials — Google, 2024

https://scholar.google.com/scholar?q=Google+TPU+system+architecture+materials

5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019

https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism

6. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023

https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention

7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022

https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness

8. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, et al., 2021

https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning

9. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ying et al., 2024

https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving

10. PyramidInfer: Pyramid KV Cache Compression for High-Throughput LLM Inference — approx. recent LLM systems authors, 2024/2025

https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-Throughput+LLM+Inference

11. Inference-Time Hyper-Scaling with KV Cache Compression — approx. recent LLM inference authors, 2024/2025

https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression

12. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache — approx. recent LLM compression authors, 2025

https://scholar.google.com/scholar?q=Q-Hitter:+A+Better+Token+Oracle+for+Efficient+LLM+Inference+via+Sparse-Quantized+KV+Cache

13. SPQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Dettmers et al. / approximate, 2023

https://scholar.google.com/scholar?q=SPQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression

14. Enabling Dynamic Sparsity in Quantized LLM Inference — approx. recent sparse-quantized inference authors, 2024/2025

https://scholar.google.com/scholar?q=Enabling+Dynamic+Sparsity+in+Quantized+LLM+Inference

15. Iso: Overlap of Computation and Communication Within Sequence for LLM Inference — approx. recent distributed inference authors, 2025

https://scholar.google.com/scholar?q=Iso:+Overlap+of+Computation+and+Communication+Within+Sequence+for+LLM+Inference

16. Throughput Maximization for Transformer Inference on Processing Near-Memory Architectures — approx. recent PNM authors, 2024/2025

https://scholar.google.com/scholar?q=Throughput+Maximization+for+Transformer+Inference+on+Processing+Near-Memory+Architectures

17. Improving Computation and Memory Efficiency for Real-World Transformer Inference on GPUs — approx. recent GPU systems authors, 2024/2025

https://scholar.google.com/scholar?q=Improving+Computation+and+Memory+Efficiency+for+Real-World+Transformer+Inference+on+GPUs

18. Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference — approx. survey authors, 2024/2025

https://scholar.google.com/scholar?q=Memory+Is+All+You+Need:+An+Overview+of+Compute-in-Memory+Architectures+for+Accelerating+Large+Language+Model+Inference

19. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3

20. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/

21. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3

22. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3

23. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

24. AI Post Transformers: AI and the Memory Wall: Overcoming Bottlenecks — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/ai-and-the-memory-wall-overcoming-bottlenecks/

25. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/

26. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3

27. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/

28. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3

Interactive Visualization: FengHuang for Rack-Scale LLM Inference Memory

...more

View all episodes

By mcgrof