00:00:00 Start00:17:41 DeepSeekMathBase+7B+的数学能力评估00:27:47 DeepSeekMath-RL+的训练与评估00:40:24 探索抗噪声奖励信号的算法00:46:35 DPO、PPO+和+GRPO+的目标及梯度00:51:30 Closing

00:00:00 Start 00:17:41 DeepSeekMathBase+7B+的数学能力评估 00:27:47 DeepSeekMath-RL+的训练与评估 00:40:24 探索抗噪声奖励信号的算法 00:46:35 DPO、PPO+和+GRPO+的目标及梯度 00:51:30 Closing

Share DeepseekMath

Sign up to save your podcasts

DeepseekMath

DeepseekMath