
Sign up to save your podcasts
Or
The provided research paper investigates challenges in using process reward models (PRMs) for reinforcement fine-tuning (RFT) of large language models (LLMs) on reasoning tasks, specifically addressing the issue of reward hacking caused by traditional summation-based credit assignment. To mitigate this, the authors introduce PURE (Process sUpervised Reinforcement lEarning), a novel framework employing a min-form credit assignment that considers the minimum future reward, leading to more stable and efficient training. Their experiments demonstrate that PURE achieves comparable or better reasoning performance with PRMs compared to methods using verifiable rewards, and that combining PRMs with a small amount of verifiable rewards further enhances performance and reduces reward hacking. The paper also analyzes various cases of reward hacking and the causes of training collapse, offering insights for future research in PRM-based RFT.
The provided research paper investigates challenges in using process reward models (PRMs) for reinforcement fine-tuning (RFT) of large language models (LLMs) on reasoning tasks, specifically addressing the issue of reward hacking caused by traditional summation-based credit assignment. To mitigate this, the authors introduce PURE (Process sUpervised Reinforcement lEarning), a novel framework employing a min-form credit assignment that considers the minimum future reward, leading to more stable and efficient training. Their experiments demonstrate that PURE achieves comparable or better reasoning performance with PRMs compared to methods using verifiable rewards, and that combining PRMs with a small amount of verifiable rewards further enhances performance and reduces reward hacking. The paper also analyzes various cases of reward hacking and the causes of training collapse, offering insights for future research in PRM-based RFT.