March 01, 2026

EP097: DeepSeek R1 Taught Itself to Reason

22 minutes

DeepSeek-R1 is a research initiative by DeepSeek-AI focused on significantly enhancing the reasoning capabilities of Large Language Models (LLMs) through Reinforcement Learning (RL). The paper details the development of two primary models and a series of smaller distilled models:

DeepSeek-R1-Zero: The researchers first explored bypassing traditional supervised fine-tuning (SFT) by training a model entirely via pure RL using Group Relative Policy Optimization (GRPO). By rewarding the model solely on the accuracy of its final answers, DeepSeek-R1-Zero autonomously developed advanced, emergent reasoning strategies, such as self-verification, reflection, and exploring alternative solutions. However, it suffered from poor readability and frequent language mixing.
DeepSeek-R1: To resolve R1-Zero's limitations and improve general performance, the team created DeepSeek-R1 using a multi-stage training pipeline. This pipeline incorporates a small amount of "cold-start" SFT data, RL for reasoning, rejection sampling, and further RL using human-preference reward models to align the model for helpfulness and harmlessness. DeepSeek-R1 achieves state-of-the-art performance across math, coding, and general benchmarks, rivaling top-tier closed-source models like OpenAI's o1.
Distilled Models: Finally, the researchers demonstrated that the advanced reasoning patterns of DeepSeek-R1 can be used to teach smaller, more efficient models. By fine-tuning open-source models (like Qwen and Llama) on outputs generated by DeepSeek-R1, they successfully distilled powerful reasoning capabilities into smaller models (ranging from 1.5B to 70B parameters), which heavily outperformed their non-reasoning counterparts.

...more

View all episodes

By Yun Wu

March 01, 2026

EP097: DeepSeek R1 Taught Itself to Reason

22 minutes

DeepSeek-R1-Zero: The researchers first explored bypassing traditional supervised fine-tuning (SFT) by training a model entirely via pure RL using Group Relative Policy Optimization (GRPO). By rewarding the model solely on the accuracy of its final answers, DeepSeek-R1-Zero autonomously developed advanced, emergent reasoning strategies, such as self-verification, reflection, and exploring alternative solutions. However, it suffered from poor readability and frequent language mixing.
DeepSeek-R1: To resolve R1-Zero's limitations and improve general performance, the team created DeepSeek-R1 using a multi-stage training pipeline. This pipeline incorporates a small amount of "cold-start" SFT data, RL for reasoning, rejection sampling, and further RL using human-preference reward models to align the model for helpfulness and harmlessness. DeepSeek-R1 achieves state-of-the-art performance across math, coding, and general benchmarks, rivaling top-tier closed-source models like OpenAI's o1.
Distilled Models: Finally, the researchers demonstrated that the advanced reasoning patterns of DeepSeek-R1 can be used to teach smaller, more efficient models. By fine-tuning open-source models (like Qwen and Llama) on outputs generated by DeepSeek-R1, they successfully distilled powerful reasoning capabilities into smaller models (ranging from 1.5B to 70B parameters), which heavily outperformed their non-reasoning counterparts.

...more

Share EP097: DeepSeek R1 Taught Itself to Reason

Sign up to save your podcasts

EP097: DeepSeek R1 Taught Itself to Reason

EP097: DeepSeek R1 Taught Itself to Reason