May 02, 2026

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

23 minutes

Source: SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Paper was published on April 26, 2026

This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

An ETH Zurich group sat down to reproduce the hot new mixed-policy methods for training reasoning models — and found their plain SFT baseline beating the published baselines by five points. Pulling the thread led to two silent bugs in widely-used training libraries that had been deflating baselines across an entire subfield for over a year, and to an uncomfortable question about how much of recent benchmark progress is real.

Key Takeaways

How a misplaced branch in DeepSpeed's CPU-offloading code silently discarded most micro-batch gradients during accumulation, shrinking effective training signal without any warning

Why a 'mean-of-means' loss aggregation bug in OpenRLHF systematically mis-weights SFT updates when response lengths vary, and how it migrated from pretraining code where it was harmless

A clean four-number staircase that attributes a five-point baseline gap almost entirely to the optimizer bug, with the loss bug contributing under a point

Why corrected SFT-then-RL beats every published mixed-policy method on math benchmarks — by 3.8 points on Qwen and a striking 22 points on Llama — at roughly half the FLOPs

The structural lesson: when a whole subfield's baselines flow through the same library, independent replication becomes illusory, and framework diversity functions as epistemic insurance

Where the paper's claims have real edges — single-seed reproductions, math-only benchmarks, and the open question of whether mixed-policy methods could still help on top of a properly trained SFT model

00:00 — The reproduction that wouldn't reproduce
How an ETH group's two SFT baselines, run in different frameworks with identical settings, disagreed by five-and-a-half points and started the investigation.

02:52 — The DeepSpeed gradient accumulation bug
A misplaced conditional in CPU-offloading code meant only the first micro-batch's gradients were ever copied to the optimizer — silently, for over a year.

05:45 — Why only the baselines were sick
The asymmetry that made the bug invisible: mixed-policy methods ran on healthy verl/FSDP infrastructure while their SFT baselines ran through DeepSpeed.

08:38 — The mean-of-means loss bug
How OpenRLHF's distributed loss aggregation systematically mis-weights tokens when batch sizes vary, and how it leaked in from pretraining code where it was harmless.

11:30 — The four-number staircase
A controlled ablation that attributes the baseline gap to each bug individually and shows the patched pipeline matching an independently implemented clean baseline.

14:23 — Corrected baselines flip the field's conclusions
On Qwen and especially on Llama, a properly trained SFT-then-RL pipeline beats every published mixed-policy method, often dramatically and at lower compute cost.

17:16 — Where the paper's claims have edges
A steelman pass on dataset scope, single-seed reproductions, hyperparameter tuning, and what the paper does and does not rule out about mixed-policy methods.

20:08 — The structural lesson about shared infrastructure
Why concentrated tooling turns independent replications into a single point of failure, and what framework diversity buys a subfield epistemically.

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

23 minutes

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Source: SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Paper was published on April 26, 2026

Key Takeaways

How a misplaced branch in DeepSpeed's CPU-offloading code silently discarded most micro-batch gradients during accumulation, shrinking effective training signal without any warning

Why a 'mean-of-means' loss aggregation bug in OpenRLHF systematically mis-weights SFT updates when response lengths vary, and how it migrated from pretraining code where it was harmless

A clean four-number staircase that attributes a five-point baseline gap almost entirely to the optimizer bug, with the loss bug contributing under a point

Why corrected SFT-then-RL beats every published mixed-policy method on math benchmarks — by 3.8 points on Qwen and a striking 22 points on Llama — at roughly half the FLOPs

The structural lesson: when a whole subfield's baselines flow through the same library, independent replication becomes illusory, and framework diversity functions as epistemic insurance

05:45 — Why only the baselines were sick
The asymmetry that made the bug invisible: mixed-policy methods ran on healthy verl/FSDP infrastructure while their SFT baselines ran through DeepSpeed.

Share How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

Sign up to save your podcasts

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers

How Two Silent Library Bugs Quietly Invalidated a Wave of Reasoning Papers