
Sign up to save your podcasts
Or


This episode addresses the real bottleneck after you build an LLM: fitting it into hardware that can actually run it. We explore why GPU memory is the scarce resource, how weights, KV cache, and activations compete for that space, and what that means in practice when prompts get long or concurrency spikes. We compare data center GPUs (high bandwidth HBM) versus local machines like the Mac Studio (huge unified memory but slower bandwidth) to show the core tradeoff between capacity and speed. By the end, you will understand how to choose hardware based on your goal, and why the next lever is quantization to shrink models enough to fit, with a closing reflection on perspective when something big feels like it will not fit.
By Sheetal ’Shay’ DharThis episode addresses the real bottleneck after you build an LLM: fitting it into hardware that can actually run it. We explore why GPU memory is the scarce resource, how weights, KV cache, and activations compete for that space, and what that means in practice when prompts get long or concurrency spikes. We compare data center GPUs (high bandwidth HBM) versus local machines like the Mac Studio (huge unified memory but slower bandwidth) to show the core tradeoff between capacity and speed. By the end, you will understand how to choose hardware based on your goal, and why the next lever is quantization to shrink models enough to fit, with a closing reflection on perspective when something big feels like it will not fit.