Mechanical Dreams

Challenges and Research Directions for Large Language Model Inference Hardware


Listen Later

In this episode:
• Introduction: The Disconnect: Professor Norris and Linda introduce the paper 'Challenges and Research Directions for Large Language Model Inference Hardware' by Ma and Patterson, discussing the widening gap between academic architecture research and industry reality.
• The Inference Crisis: Prefill vs. Decode: The hosts break down why LLM inference is fundamentally different from training, explaining the 'Memory Wall' and the specific bottleneck of the autoregressive Decode phase.
• Solution 1: High Bandwidth Flash: Linda proposes High Bandwidth Flash (HBF) as a solution for capacity, while Professor Norris questions the latency and endurance issues inherent to flash memory.
• Solution 2 & 3: PNM and 3D Stacking: A discussion on Processing-Near-Memory (PNM) versus Processing-In-Memory (PIM), and how 3D stacking can shorten the distance between compute and data.
• Solution 4: Interconnects and New Metrics: The duo discusses why latency matters more than bandwidth for inference interconnects, and concludes with a look at new evaluation metrics like TCO and Carbon Footprint.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk