May 25, 2026

Same Model, Organized Differently: How an Agent Architecture Beat Frontier Systems at Research Math

22 minutes

Source: RMA: an Agentic System for Research-Level Mathematical Problems

Paper was published on May 20, 2026

This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A university research group just outperformed OpenAI and DeepMind's flagship math systems on a benchmark of problems contributed by working mathematicians — using the same base model that scores zero on its own. The trick wasn't a bigger model. It was decomposing the work of a mathematician into specialized agents sharing a structured whiteboard, and the implications for AI progress reach well beyond math.

Key Takeaways

How RMA, built on Claude Opus 4.6, solves 8 of 10 First Proof problems while the same model with no scaffolding solves 0

The seven-agent setup — initializer, three proposers, three verifiers — and why an append-only shared memory is what actually makes the rounds compound

The six modules that encode a working mathematician's workflow, including a Proof Commandment checklist and a pre-committed literature search designed to prevent contamination

Ablation results showing that stripping any major component — memory, verifiers, modules — collapses performance, and that more refinement rounds eventually makes proofs worse

Why the comparison to GPT-5.2R and Aletheia isn't apples-to-apples, and what the honest version of the claim actually is

The Spielman ε-light subset problem as a concrete case: GPT-5.2R hallucinates a citation and lands a weaker bound; RMA produces a clean proof with a tighter bound using a different known technique

00:00 — The headline result on the First Proof benchmark
RMA solves 8 of 10 expert-contributed problems while frontier systems and the base model alone score far lower.

02:47 — The seven-agent setup and the shared whiteboard
How initializer, proposer, and verifier agents iterate across five rounds through an append-only structured memory.

05:35 — The six modules that encode mathematical workflow
Problem analysis, knowledge bank, proof commandments, and the literature modules that turn a generic LLM into a math-research collaborator.

08:23 — Methodological discipline against contamination
Pre-committed search lists, sandboxed tools, and a training cutoff that predates the benchmark release.

11:10 — The ablation table and the architecture-versus-scale claim
Stripping modules, memory, or verifiers collapses win-rates, and a same-compute best-of-N baseline gets roughly half of RMA's performance.

13:58 — Where the claims shouldn't be pushed too far
Ten problems is a small sample, the industrial-system comparisons aren't controlled, and informal proofs resist bright-line evaluation.

16:46 — The Spielman problem as a concrete illustration
Three systems, three outcomes, and what the leverage-score proof reveals about applying known tools versus discovering new ones.

19:34 — What this means for AI progress beyond math
Why long-horizon reasoning tasks may benefit more from orchestration than from larger models, with appropriate caveats.