Share The Coverage Principle: How Pre-Training Enables Post-Training

Copy link

October 24, 2025

The Coverage Principle: How Pre-Training Enables Post-Training

16 minutes

This paper provides a theoretical analysis of next-token prediction in language models, introducing the concept of the coverage profile ($\text{Cov}_N$) as a superior metric to cross-entropy for predicting downstream performance with Best-of-N (BoN) sampling. The authors establish a "coverage principle," demonstrating that maximum likelihood, or next-token prediction, implicitly optimizes the coverage profile, leading to faster generalization that avoids the spurious dependence on sequence length seen in cross-entropy/KL divergence. The research shows that achieving a good coverage profile is necessary and sufficient for BoN success and derives scaling laws relating cross-entropy to coverage, while also exploring various optimization methods like stochastic gradient descent (SGD) and gradient normalization to provably improve coverage bounds. Finally, the text proposes tournament-style estimators for selecting models with optimal coverage, particularly in scenarios where the true data distribution is unknown.

...more

View all episodes

By Enoch H. Kang

October 24, 2025

The Coverage Principle: How Pre-Training Enables Post-Training

16 minutes

...more

Sign up to save your podcasts