AI Post Transformers

A Chain of Thought reasoning academic brawl


Listen Later

We review a late 2025 heavyweight academic brawl over the future of AI reasoning when folks use Reinforcement Learning with Verifiable Rewards (RLVR) for Chain of Thought (CoT). We have two papers on the table, and they are in complete, direct conflict. In one corner, we have a study by Yue et al. from Tsinghua University, which dropped a bombshell claim: they argue that Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind major models like DeepSeek-R1—does not actually make models smarter. According to their research, RLVR acts more like a filter than a teacher; it improves the model's efficiency at finding correct answers it already knew how to find, but it fails to expand the model's reasoning capabilities. In fact, they claim that as training progresses, the model's reasoning boundary actually narrows. But in the other corner, we have a rebuttal from Wen et al. at Microsoft Research, who came out swinging. They explicitly cite Yue’s paper, labeling the Tsinghua team’s hypothesis as "adventurous" and challenging their methodology. Wen et al. argue that Yue’s team missed the forest for the trees by relying on the wrong metric. They claim that because models can sometimes guess the right answer with the wrong math, the standard evaluation (Pass@K) is unreliable. By introducing a new metric that checks the steps of reasoning (CoT-Pass@K), Wen et al. insist that RLVR does fundamentally extend the boundary of intelligence and that the skeptics were looking at the data all wrong. It is a classic scientific standoff: one side says the technology is an efficiency hack; the other says it's a fundamental leap forward.Sources:Paper 1November 25, 2025Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao HuangDoes Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?https://arxiv.org/pdf/2504.13837Paper 2June 2025Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao YangREINFORCEMENT LEARNING WITH VERIFIABLE REWARDS IMPLICITLY INCENTIVIZES CORRECT REASONING IN BASE LLMShttps://arxiv.org/pdf/2506.14245
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof