May 09, 2026

Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization

22 minutes

Source: Recursive Agent Optimization

Paper was published on May 07, 2026

This episode was AI-generated on May 8, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A 30-billion-parameter open model keeps pace with Claude Sonnet 4 and OpenAI's o3 on a long-context benchmark — not by being bigger, but by learning to spawn copies of itself and delegate. A new paper argues recursion shouldn't be a scaffold wrapped around frozen models; it should be a primitive the weights are actually trained to use, and the results suggest a different axis for scaling agents than bigger models or longer context.

Key Takeaways

Why RAO's central move — putting recursive delegation inside the RL loop instead of around a frozen model — is the whole intellectual contribution

How rewarding average (not summed) child success teaches the model when delegation is worth it, not just how to do it

The phase-transition result on hard crafting tasks: 0% to 88% with the same 4B base model, generalizing past its training depth

How a 30B recursive agent matches Sonnet 4 and o3 on Oolong-Real despite a context window six times smaller than the inputs

Why the same trained agent fans out 85% in parallel on independent sub-tasks but serializes to 1.5% on chained ones

The honest costs: RAO is up to 18x slower in wall clock on some tasks, models are trained per task family, and the strongest results come from benchmarks whose structure suits the method

00:00 — The setup: an agent that can spawn itself
How RAO adds one async Python function — spawn a child with a fresh context — and lets recursive trees emerge from ordinary control flow.

02:17 — The Kyoto travel example
A walk through Figure 1 of the paper, where the model dynamically grows a three-level delegation tree to plan a trip.

04:35 — Scaffold versus trained behavior
Why existing recursive systems like Claude Code and Codex wrap frozen models, and what changes when the weights themselves learn to delegate.

06:53 — Local rewards and the 'average child success' trick
How RAO scores each node with its own task success plus the mean (not sum) of its children's success, and why that distinction kills bad incentives.

09:11 — Baselines and variance reduction
The unusual choice to apply a single root-task leave-one-out baseline across every node in the tree, and the tradeoffs the authors flag.

11:28 — TextCraft-Synth: phase transition on hard tasks
On the authors' own crafting benchmark, the 4B recursive agent jumps from 0% to 88% on hard tasks and learns to grow trees deeper than it was trained for.

13:46 — Oolong-Real: matching frontier models with a smaller window
A 30B recursive agent reaches roughly the same scores as Sonnet 4 and o3 on long D&D transcripts, including a moment where it briefly learns the wrong strategy and recovers.

16:04 — Deep Dive: when recursion can't parallelize
On a multi-hop research benchmark with sequentially dependent sub-tasks, the recursive agent gets more answers right but runs about 18x slower in wall clock.

18:22 — The steelman critique
Where the benchmarks favor RAO's structure, how LLM-judge reward signals could confound the results, and what the compute-equivalent comparison would look like.

20:39 — What this says about scaling
Why RAO is a vote for training models to use inference-time scaffolds, and how it reframes test-time compute scaling as a tree of agents rather than one long thought.