February 25, 2026

Module 4: Quantization - Shrinking Models Without Breaking Them

11 minutes

This episode tackles the lever that turns powerful LLMs into something you can actually run: quantization. We explore what it means to store model weights with fewer bits, why that can cut memory in half at 8-bit and down to roughly a quarter at 4-bit, and the real tradeoff between compression and capability as rounding error accumulates across billions of parameters. We break down why large models survive this better than small ones, why 8-bit is often near lossless, why 4-bit can still be shockingly strong, and why going below that can make models fall apart. We compare the three practical paths you will see in the wild: GPTQ (layer-wise compression with error compensation), AWQ (protecting the most important weights), and GGUF (the local-friendly format that makes CPU and GPU splitting possible).

...more

View all episodes

By Sheetal ’Shay’ Dhar

February 25, 2026

Module 4: Quantization - Shrinking Models Without Breaking Them

11 minutes

...more

Share Module 4: Quantization - Shrinking Models Without Breaking Them

Sign up to save your podcasts

Module 4: Quantization - Shrinking Models Without Breaking Them

Module 4: Quantization - Shrinking Models Without Breaking Them