How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers
Source: SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Paper was published on April 26, 2026
This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
An ETH Zurich group sat down to reproduce the hot new mixed-policy methods for training reasoning models — and found their plain SFT baseline beating the published baselines by five points. Pulling the thread led to two silent bugs in widely-used training libraries that had been deflating baselines across an entire subfield for over a year, and to an uncomfortable question about how much of recent benchmark progress is real.
Key Takeaways
How a misplaced branch in DeepSpeed's CPU-offloading code silently discarded most micro-batch gradients during accumulation, shrinking effective training signal without any warningWhy a 'mean-of-means' loss aggregation bug in OpenRLHF systematically mis-weights SFT updates when response lengths vary, and how it migrated from pretraining code where it was harmlessA clean four-number staircase that attributes a five-point baseline gap almost entirely to the optimizer bug, with the loss bug contributing under a pointWhy corrected SFT-then-RL beats every published mixed-policy method on math benchmarks — by 3.8 points on Qwen and a striking 22 points on Llama — at roughly half the FLOPsThe structural lesson: when a whole subfield's baselines flow through the same library, independent replication becomes illusory, and framework diversity functions as epistemic insuranceWhere the paper's claims have real edges — single-seed reproductions, math-only benchmarks, and the open question of whether mixed-policy methods could still help on top of a properly trained SFT model00:00 — The reproduction that wouldn't reproduce
How an ETH group's two SFT baselines, run in different frameworks with identical settings, disagreed by five-and-a-half points and started the investigation.02:52 — The DeepSpeed gradient accumulation bug
A misplaced conditional in CPU-offloading code meant only the first micro-batch's gradients were ever copied to the optimizer — silently, for over a year.05:45 — Why only the baselines were sick
The asymmetry that made the bug invisible: mixed-policy methods ran on healthy verl/FSDP infrastructure while their SFT baselines ran through DeepSpeed.08:38 — The mean-of-means loss bug
How OpenRLHF's distributed loss aggregation systematically mis-weights tokens when batch sizes vary, and how it leaked in from pretraining code where it was harmless.11:30 — The four-number staircase
A controlled ablation that attributes the baseline gap to each bug individually and shows the patched pipeline matching an independently implemented clean baseline.14:23 — Corrected baselines flip the field's conclusions
On Qwen and especially on Llama, a properly trained SFT-then-RL pipeline beats every published mixed-policy method, often dramatically and at lower compute cost.17:16 — Where the paper's claims have edges
A steelman pass on dataset scope, single-seed reproductions, hyperparameter tuning, and what the paper does and does not rule out about mixed-policy methods.20:08 — The structural lesson about shared infrastructure
Why concentrated tooling turns independent replications into a single point of failure, and what framework diversity buys a subfield epistemically.Recommended Reading
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — The paper that established the SFT-then-RL recipe this episode defends, and the baseline against which mixed-policy methods positioned themselves.LUFFY: Learning to Reason under Off-Policy Guidance — One of the mixed-policy methods whose published advantage over SFT-then-RL the episode argues was an artifact of buggy baselines.The Unreasonable Effectiveness of Eccentric Automatic Prompts — A different flavor of the same lesson — apparent model 'limitations' often turn out to be artifacts of the surrounding pipeline rather than the model itself.