Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization
Source: Recursive Agent Optimization
Paper was published on May 07, 2026
This episode was AI-generated on May 8, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A 30-billion-parameter open model keeps pace with Claude Sonnet 4 and OpenAI's o3 on a long-context benchmark — not by being bigger, but by learning to spawn copies of itself and delegate. A new paper argues recursion shouldn't be a scaffold wrapped around frozen models; it should be a primitive the weights are actually trained to use, and the results suggest a different axis for scaling agents than bigger models or longer context.
Key Takeaways
Why RAO's central move — putting recursive delegation inside the RL loop instead of around a frozen model — is the whole intellectual contributionHow rewarding average (not summed) child success teaches the model when delegation is worth it, not just how to do itThe phase-transition result on hard crafting tasks: 0% to 88% with the same 4B base model, generalizing past its training depthHow a 30B recursive agent matches Sonnet 4 and o3 on Oolong-Real despite a context window six times smaller than the inputsWhy the same trained agent fans out 85% in parallel on independent sub-tasks but serializes to 1.5% on chained onesThe honest costs: RAO is up to 18x slower in wall clock on some tasks, models are trained per task family, and the strongest results come from benchmarks whose structure suits the method00:00 — The setup: an agent that can spawn itself
How RAO adds one async Python function — spawn a child with a fresh context — and lets recursive trees emerge from ordinary control flow.02:17 — The Kyoto travel example
A walk through Figure 1 of the paper, where the model dynamically grows a three-level delegation tree to plan a trip.04:35 — Scaffold versus trained behavior
Why existing recursive systems like Claude Code and Codex wrap frozen models, and what changes when the weights themselves learn to delegate.06:53 — Local rewards and the 'average child success' trick
How RAO scores each node with its own task success plus the mean (not sum) of its children's success, and why that distinction kills bad incentives.09:11 — Baselines and variance reduction
The unusual choice to apply a single root-task leave-one-out baseline across every node in the tree, and the tradeoffs the authors flag.11:28 — TextCraft-Synth: phase transition on hard tasks
On the authors' own crafting benchmark, the 4B recursive agent jumps from 0% to 88% on hard tasks and learns to grow trees deeper than it was trained for.13:46 — Oolong-Real: matching frontier models with a smaller window
A 30B recursive agent reaches roughly the same scores as Sonnet 4 and o3 on long D&D transcripts, including a moment where it briefly learns the wrong strategy and recovers.16:04 — Deep Dive: when recursion can't parallelize
On a multi-hop research benchmark with sequentially dependent sub-tasks, the recursive agent gets more answers right but runs about 18x slower in wall clock.18:22 — The steelman critique
Where the benchmarks favor RAO's structure, how LLM-judge reward signals could confound the results, and what the compute-equivalent comparison would look like.20:39 — What this says about scaling
Why RAO is a vote for training models to use inference-time scaffolds, and how it reframes test-time compute scaling as a tree of agents rather than one long thought.Recommended Reading
ADaPT: As-Needed Decomposition and Planning with Language Models — An inference-time recursive decomposition system that the RAO paper positions itself against — useful for seeing what 'recursion as scaffold around a frozen model' looks like before training enters the picture.Toolformer: Language Models Can Teach Themselves to Use Tools — The canonical example of the 'train models to use scaffolds, don't just prompt them' principle the episode highlights as RAO's intellectual lineage.Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The original prompting trick that later became trained reasoning behavior — the precedent Finn cites for why recursive delegation is plausibly the next scaffold-to-weights transition.Tree of Thoughts: Deliberate Problem Solving with Large Language Models — An earlier vision of branching, tree-structured reasoning at inference time, useful as a contrast to RAO's training-time approach to tree-structured agent execution.