Mechanical Dreams

Scaling Laws for Precision


Listen Later

In this episode:
• Introduction to Precision in Scaling Laws: Linda introduces the new paper which adds precision as a third variable to the Chinchilla scaling laws. Professor Norris reflects on how precision is usually treated as an afterthought.
• The Post-Training Quantization Paradox: The hosts discuss the surprising finding that overtraining models on too much data actually makes them degrade worse when applying post-training quantization.
• Effective Parameters and Low-Precision Training: Linda explains the concept of effective parameter count, and how lowering precision in weights, activations, and KV cache shrinks the model's effective size multiplicatively.
• Finding the Compute-Optimal Precision: Professor Norris is surprised to learn that compute-optimal pretraining precision is around 7 to 8 bits, completely independent of the compute budget unless model size is constrained.
• A Unified Scaling Law and Takeaways: The episode wraps up by bringing pretraining and post-training precision into a single mathematical framework, discussing what this means for the future of model training.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk