
Sign up to save your podcasts
Or
This document introduces Prolonged Reinforcement Learning (ProRL), a new training method designed to significantly enhance the reasoning abilities of large language models. By implementing KL divergence control and reference policy resetting, ProRL maintains training stability over extended periods, allowing models to discover novel reasoning strategies and outperform base models across a variety of tasks including math, code, STEM, and logic puzzles. The research indicates that RL is particularly effective for tasks where the base model initially struggles, and that these sustained training gains demonstrate a genuine expansion of reasoning boundaries, even on unseen tasks. The work highlights the potential of long-horizon RL to create more capable and generalizable AI systems, exemplified by their Nemotron-Research-Reasoning-Qwen-1.5B model.
This document introduces Prolonged Reinforcement Learning (ProRL), a new training method designed to significantly enhance the reasoning abilities of large language models. By implementing KL divergence control and reference policy resetting, ProRL maintains training stability over extended periods, allowing models to discover novel reasoning strategies and outperform base models across a variety of tasks including math, code, STEM, and logic puzzles. The research indicates that RL is particularly effective for tasks where the base model initially struggles, and that these sustained training gains demonstrate a genuine expansion of reasoning boundaries, even on unseen tasks. The work highlights the potential of long-horizon RL to create more capable and generalizable AI systems, exemplified by their Nemotron-Research-Reasoning-Qwen-1.5B model.