Share Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Copy link

May 16, 2025

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

17 minutes

This research explores ways to make Reinforcement Learning from Human Feedback (RLHF) more sample-efficient by leveraging imperfect reward models. The authors identify a key property of the KL-regularized RLHF objective, showing that a policy's ability to cover the optimal policy is linked to its sub-optimality, which suggests that higher policy value indicates better coverage. Building on this insight, they propose a novel transfer learning approach and a theoretically-sound algorithm, Transfer Policy Optimization (TPO), which uses a policy-value-based transfer selection strategy and incorporates "self-transfer learning" from data collected during the online process. They also develop a more practical empirical TPO algorithm that uses win rates for policy selection to reduce computational costs and demonstrate its effectiveness on summarization tasks.

...more

View all episodes

By Enoch H. Kang

May 16, 2025

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

17 minutes

...more

Sign up to save your podcasts