May 05, 2025

【第217期】Open-Reasoner-Zero：开源的推理能力提升方法

21 minutes

Seventy3：借助NotebookLM的能力进行论文解读，专注人工智能、大模型、机器人算法方向，让大家跟着AI一起进步。

进群添加小助手微信：seventy3_podcast

备注：小宇宙

今天的主题是：Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Summary

Open-Reasoner-Zero (ORZ) is introduced as an open-source project focused on large-scale reinforcement learning for reasoning in large language models. The authors demonstrate that a simple approach using vanilla PPO and a basic reward function can effectively scale up reasoning abilities, even outperforming a prior method (DeepSeek-R1-Zero) on a benchmark while using significantly fewer training steps. To promote accessibility, ORZ releases its code, data, and model weights. Key findings highlight the effectiveness of minimalist RL designs and the importance of scaling training data.

开源推理者零号（ORZ）作为一个开源项目被介绍，专注于大规模强化学习，以提升大型语言模型的推理能力。作者展示了使用普通PPO和基本奖励函数的简单方法可以有效提升推理能力，甚至在使用显著更少的训练步骤的情况下，超越了之前的DeepSeek-R1-Zero基准测试。为了促进可访问性，ORZ发布了其代码、数据和模型权重。关键发现强调了极简强化学习设计的有效性以及扩展训练数据的重要性。

原文链接：https://arxiv.org/abs/2503.24290

...more