May 17, 2026

How a 30B Open Model Reached Olympiad Gold With the Right Recipe

31 minutes

Source: Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Paper was published on May 13, 2026

This episode was AI-generated on May 16, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A thirty-billion-parameter open-source model just matched the top human score on the 2026 USAMO — and the full training recipe is public. The result suggests that olympiad-grade proof reasoning, long assumed to require trillion-parameter frontier systems, may have been more about training procedure than raw scale.

Key Takeaways

Why proof-writing and answer-finding are fundamentally different skills, and how models can score 95% on answer-based math but 20% on proof benchmarks

The reverse-perplexity curriculum: feeding the model its most surprising training examples first, and why it beats both random and easy-first ordering

How a two-stage RL progression — cheap verifiable rewards, then expensive proof-quality rewards — extracts more capability than either alone

The test-time scaling loop where the model writes ~100,000-token proofs, critiques them, and iterates up to 30 times per attempt

The honest asterisks: human-vs-model comparison conditions, grading regime differences, and the substantial inference compute the recipe requires

Where SU-01 is genuinely strong (formally tractable problems) versus where it still fails (global combinatorial structure and delicate invariants)

00:00 — The headline result and why it's different
SU-01 matches the top human USAMO 2026 score using an open 30B mixture-of-experts model with a fully documented recipe.

03:27 — Answer-finding versus proof-writing
Why olympiad problems expose a hidden ceiling that answer-based benchmarks miss, and how the paper frames the core training challenge.

06:55 — Stage one: reverse-perplexity curriculum
The counterintuitive idea of training on the most surprising examples first, with ablations showing it nearly doubles performance over easy-first ordering.

10:23 — Stage two: coarse RL on verifiable answers
Why sequence-level GSPO works for MoE models where standard token-level GRPO breaks down, and how this stage halves the gap to frontier performance.

13:51 — Stage three: refined RL with a proof-grading judge
Moving from answer-correctness to proof-validity rewards, with self-refinement on failures and a low-entropy experience replay buffer for rare successes.

17:19 — Stage four: test-time scaling and 100,000-token proofs
The inference-time loop of drafting, self-critiquing, and iteratively repairing proofs that produces the gold-medal-level scores.

20:46 — The steelman: asterisks on the headline numbers
Honest accounting of grading regime differences, inference compute costs, and what the human-comparison framing does and doesn't support.

24:14 — Where the model is strong and where it fails
A capability map showing SU-01 excels at problems with rigid formal structure but struggles with global combinatorial arguments — including the elegant complex-number solution to USAMO Problem 3.

27:42 — Specializable generalists and what's actually new
The paper's broader claim that training recipe — not just scale — is doing more of the work, and which methodological contributions transfer beyond olympiad math.