February 27, 2025

【第150期】DeepSeek-R1

15 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Summary

DeepSeek-AI introduces DeepSeek-R1-Zero and DeepSeek-R1, reasoning-focused large language models. DeepSeek-R1-Zero uses reinforcement learning (RL) without supervised fine-tuning (SFT) to achieve remarkable reasoning capabilities. DeepSeek-R1 builds upon this by incorporating multi-stage training and "cold-start" data before RL, achieving results comparable to OpenAI's models. The company releases DeepSeek-R1-Zero, DeepSeek-R1, and distilled smaller models to support the research community. Experiments demonstrate that DeepSeek-R1 excels in reasoning tasks, outperforming other models in certain benchmarks, and distillation from DeepSeek-R1 greatly improves the reasoning abilities of smaller models. The study explores the benefits of RL and distillation, also discussing unsuccessful methods like Process Reward Models and Monte Carlo Tree Search.

DeepSeek-AI推出了DeepSeek-R1-Zero和DeepSeek-R1，这两款专注于推理的大型语言模型。DeepSeek-R1-Zero通过强化学习（RL）实现了显著的推理能力，而无需监督微调（SFT）。DeepSeek-R1在此基础上进一步发展，结合了多阶段训练和“冷启动”数据，在进行RL之前进行预训练，取得了与OpenAI模型相当的成果。公司发布了DeepSeek-R1-Zero、DeepSeek-R1以及经过蒸馏的小型模型，以支持研究社区。实验表明，DeepSeek-R1在推理任务上表现出色，在某些基准测试中超越了其他模型，并且从DeepSeek-R1进行蒸馏显著提升了小型模型的推理能力。研究还探讨了强化学习和蒸馏的优势，并讨论了如过程奖励模型和蒙特卡洛树搜索等未能成功的方法。

原文链接：https://arxiv.org/abs/2501.12948

...more