Share ExpRL: Using Reference Solutions as Rewards for LLM Mid-Training

Copy link

June 21, 2026

ExpRL: Using Reference Solutions as Rewards for LLM Mid-Training

21 minutes

Exploratory RL (ExpRL) is an automated mid-training method designed to enhance the reasoning capabilities of large language models before they undergo standard reinforcement learning. While traditional reinforcement learning often struggles with sparse rewards on difficult problems, ExpRL uses human-written reference solutions as reward scaffolds to provide dense, informative feedback on partial progress. This approach employs an LLM judge to evaluate on-policy reasoning traces against specific rubrics, assigning rewards at both the outcome and process levels to reinforce productive intermediate steps. By shifting probability mass toward successful solution strategies, the method significantly improves pass@k performance and broadens the model’s coverage of complex reasoning paths. Experimental results demonstrate that ExpRL creates a superior initialization for subsequent training, outperforming supervised fine-tuning and standard distillation across challenging math and science benchmarks. Ultimately, this technique fosters sophisticated behaviors like self-correction and backtracking, which are essential for solving high-level reasoning tasks.

...more

View all episodes

By Enoch H. Kang

June 21, 2026

ExpRL: Using Reference Solutions as Rewards for LLM Mid-Training

21 minutes

...more

Sign up to save your podcasts