Share Coverage Improvement and Fast Convergence of On-policy Preference Learning

Copy link

January 17, 2026

Coverage Improvement and Fast Convergence of On-policy Preference Learning

14 minutes

This paper provides a theoretical and empirical analysis of **on-policy preference learning**, a method used to align large language models with human values. The authors introduce the **coverage improvement principle**, demonstrating that updating a model using its own generated data—rather than static, offline datasets—creates a feedback loop that makes subsequent data increasingly informative. This process allows **on-policy Direct Preference Optimization (DPO)** to achieve **exponentially faster convergence** and lower sample complexity compared to traditional offline approaches. To further optimize this alignment, the researchers propose a **hybrid sampler** based on a novel **preferential G-optimal design** that can guarantee convergence in only two training rounds. Additionally, they develop **reward distillation schemes** that utilize relative reward signals to achieve even faster learning rates than standard preference-based methods. Experimental results on **summarization and chat tasks** confirm that these on-policy techniques yield stable, monotonic performance gains while avoiding the degradation often seen in offline models.

...more

View all episodes

By Enoch H. Kang

January 17, 2026

Coverage Improvement and Fast Convergence of On-policy Preference Learning

14 minutes

...more

Sign up to save your podcasts