The AI Concepts Podcast

Module 4: Quantization - Shrinking Models Without Breaking Them


Listen Later

This episode tackles the lever that turns powerful LLMs into something you can actually run: quantization. We explore what it means to store model weights with fewer bits, why that can cut memory in half at 8-bit and down to roughly a quarter at 4-bit, and the real tradeoff between compression and capability as rounding error accumulates across billions of parameters. We break down why large models survive this better than small ones, why 8-bit is often near lossless, why 4-bit can still be shockingly strong, and why going below that can make models fall apart. We compare the three practical paths you will see in the wild: GPTQ (layer-wise compression with error compensation), AWQ (protecting the most important weights), and GGUF (the local-friendly format that makes CPU and GPU splitting possible).

...more
View all episodesView all episodes
Download on the App Store

The AI Concepts PodcastBy Sheetal ’Shay’ Dhar