March 11, 2026

EP118: The AI Memory Wall Crisis

22 minutes

Challenges and Research Directions for Large Language Model Inference Hardware addresses the growing hardware "crisis" in datacenter AI, driven by the escalating costs of serving state-of-the-art LLMs. Advancing trends like Mixture of Experts (MoE), reasoning models, and expanding context windows place severe strains on computing resources.

The authors identify that the autoregressive "Decode" phase of LLM inference is the primary bottleneck. Unlike the parallel "Prefill" phase, Decode generates one token at a time, making it fundamentally memory-bound. This creates major challenges with the "Memory Wall" (the slow scaling and high cost of High Bandwidth Memory, or HBM) and end-to-end interconnect latency.

To address these inefficiencies, the paper proposes four promising hardware research directions:

High Bandwidth Flash (HBF): Stacking flash dies to provide 10X the memory capacity of HBM. This would allow massive models to run on significantly smaller system footprints, saving power, network overhead, and costs.
Processing-Near-Memory (PNM): Placing compute logic on separate dies very close to memory. This offers high bandwidth while avoiding the restrictive software sharding and thermal limitations of Processing-in-Memory (PIM).
3D Memory-Logic Stacking: Utilizing vertical Through Silicon Vias (TSVs) to create wide, dense memory interfaces that deliver higher bandwidth at lower power compared to 2D systems.
Low-Latency Interconnects: Redesigning networks to prioritize latency over raw bandwidth. Solutions include high-connectivity topologies (like trees or dragonfly networks) and processing-in-network capabilities to handle the frequent, small message traffic of multi-chip LLM inference.

Ultimately, the authors argue that the current AI hardware philosophy—which focuses on giant chips with maximum FLOPS—is a mismatch for LLM Decode inference. They advocate for a paradigm shift that prioritizes memory capacity, memory bandwidth, and network latency, evaluated through modern metrics like Total Cost of Ownership (TCO), power efficiency, and carbon footprint.

...more

View all episodes

By Yun Wu

March 11, 2026

EP118: The AI Memory Wall Crisis

22 minutes

To address these inefficiencies, the paper proposes four promising hardware research directions:

High Bandwidth Flash (HBF): Stacking flash dies to provide 10X the memory capacity of HBM. This would allow massive models to run on significantly smaller system footprints, saving power, network overhead, and costs.
Processing-Near-Memory (PNM): Placing compute logic on separate dies very close to memory. This offers high bandwidth while avoiding the restrictive software sharding and thermal limitations of Processing-in-Memory (PIM).
3D Memory-Logic Stacking: Utilizing vertical Through Silicon Vias (TSVs) to create wide, dense memory interfaces that deliver higher bandwidth at lower power compared to 2D systems.
Low-Latency Interconnects: Redesigning networks to prioritize latency over raw bandwidth. Solutions include high-connectivity topologies (like trees or dragonfly networks) and processing-in-network capabilities to handle the frequent, small message traffic of multi-chip LLM inference.

...more

Share EP118: The AI Memory Wall Crisis

Sign up to save your podcasts

EP118: The AI Memory Wall Crisis

EP118: The AI Memory Wall Crisis