
Sign up to save your podcasts
Or


Exploratory RL (ExpRL) is an automated mid-training method designed to enhance the reasoning capabilities of large language models before they undergo standard reinforcement learning. While traditional reinforcement learning often struggles with sparse rewards on difficult problems, ExpRL uses human-written reference solutions as reward scaffolds to provide dense, informative feedback on partial progress. This approach employs an LLM judge to evaluate on-policy reasoning traces against specific rubrics, assigning rewards at both the outcome and process levels to reinforce productive intermediate steps. By shifting probability mass toward successful solution strategies, the method significantly improves pass@k performance and broadens the model’s coverage of complex reasoning paths. Experimental results demonstrate that ExpRL creates a superior initialization for subsequent training, outperforming supervised fine-tuning and standard distillation across challenging math and science benchmarks. Ultimately, this technique fosters sophisticated behaviors like self-correction and backtracking, which are essential for solving high-level reasoning tasks.
By Enoch H. KangExploratory RL (ExpRL) is an automated mid-training method designed to enhance the reasoning capabilities of large language models before they undergo standard reinforcement learning. While traditional reinforcement learning often struggles with sparse rewards on difficult problems, ExpRL uses human-written reference solutions as reward scaffolds to provide dense, informative feedback on partial progress. This approach employs an LLM judge to evaluate on-policy reasoning traces against specific rubrics, assigning rewards at both the outcome and process levels to reinforce productive intermediate steps. By shifting probability mass toward successful solution strategies, the method significantly improves pass@k performance and broadens the model’s coverage of complex reasoning paths. Experimental results demonstrate that ExpRL creates a superior initialization for subsequent training, outperforming supervised fine-tuning and standard distillation across challenging math and science benchmarks. Ultimately, this technique fosters sophisticated behaviors like self-correction and backtracking, which are essential for solving high-level reasoning tasks.