Share LLMs Can Learn to Reason Via Off-Policy RL

Copy link

February 27, 2026

LLMs Can Learn to Reason Via Off-Policy RL

20 minutes

This research introduces Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL), a novel reinforcement learning algorithm designed to improve Large Language Model (LLM) reasoning. Traditional methods like GRPO often struggle with "off-policy" data caused by technical mismatches between training and inference engines. OAPL embraces these discrepancies by using a squared regression objective and KL-regularization, allowing the model to learn effectively even when data is significantly outdated. Empirical tests show that OAPL outperforms existing benchmarks in competition mathematics and matches top-tier coding models while using three times fewer training samples. Furthermore, the algorithm prevents entropy collapse, ensuring that the model maintains diverse and scalable problem-solving capabilities during test-time. Ultimately, the authors demonstrate that fully asynchronous, off-policy training is a more stable and efficient path for advancing machine reasoning.

...more

View all episodes

By Enoch H. Kang

February 27, 2026

LLMs Can Learn to Reason Via Off-Policy RL

20 minutes

...more

Sign up to save your podcasts