Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
Source: A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Paper was published on March 20, 2026
This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Half the time AI web agents fail, they're not wrong — they're lost, looping in circles after a single bad click. A new Google DeepMind paper argues the bottleneck isn't model intelligence but planning architecture, and shows that a single idea — milestones — can serve double duty as both runtime scaffolding and a denser training signal, lifting a 12B open model from 6% to 43% on a web navigation benchmark.
Key Takeaways
Why nearly half of agent failures are 'getting stuck' rather than misunderstanding the task — and what that says about where the real bottleneck isHow the same milestone idea solves two different problems: runtime confusion at inference time and the credit assignment problem during RL trainingHow MiRA uses a 'potential critic' trained only on successful trajectories to give per-step shaping rewards, with a mathematical guarantee against corrupting the goalWhy the headline 'open model beats GPT-4' result deserves an asterisk: the small student was trained against subgoals and progress labels generated by a frontier teacherA specific failure mode that gets *worse* after MiRA training (premature termination), and the unresolved question of whether the shaping reward is partly to blameWhy structured thinking with milestones beat brute-force thinking budgets — bigger reasoning budgets actually hurt performance past a point00:00 — Task 429 and the diagnostic microscope
A single mis-click on step three derails an agent, motivating an automated tool that classifies failure modes and pinpoints exactly where trajectories diverge.02:42 — The dominant failure mode is getting stuck, not being wrong
Across multiple models, roughly half of failed trajectories are loops and wandering — reframing the problem from intelligence to planning architecture.05:24 — SGO: milestones as inference-time scaffolding
How subgoal generation plus an introspective AutoRater gives Gemini 2.5 Pro a structured sense of where it is in a task, worth about ten points on its own.08:06 — MiRA and the potential critic
Training a second network to predict subgoal progress and using its temporal differences as a dense shaping reward, with a 2003 result from Andrew Ng guaranteeing the goal stays intact.10:49 — How the progress labels get made
Linear interpolation between subgoal completions on successful trajectories, plus a 'gap anchoring' trick to keep signal alive during the verify-and-submit phase.13:31 — The headline number and its asterisk
A 12B open model reaches 43% on WebArena-Lite, beating GPT-4 Turbo — but both the curriculum and the reward labels come from a frontier teacher model.16:13 — Where the method breaks and what's unresolved
Cold-start exploration problems, rising premature-termination errors, no shaping-reward annealing, and benchmark curation choices that deserve scrutiny.18:56 — The behavioral phase transition
A subgoal-completion heatmap across six training rounds shows the agent learning to chain milestones in order — visible evidence of learning to plan.21:38 — What's portable beyond this paper
The diagnostic methodology, the unifying milestone frame, and a recipe for synthesizing dense supervision automatically — generalizing process reward models beyond math.Recommended Reading
Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping — Andrew Ng's 1999 paper establishing the potential-based reward shaping result that MiRA leans on to guarantee its progress bonus doesn't corrupt the optimal policy.Let's Verify Step by Step — The process reward model paper Brooks name-checks at the end, showing dense step-level supervision beats outcome-only feedback in math reasoning — the lineage MiRA extends by synthesizing the step labels automatically.WebArena: A Realistic Web Environment for Building Autonomous Agents — The benchmark environment behind WebArena-Lite, useful for understanding what 'Task 429' and the five domains actually look like and why the authors chose the curated subset.WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning — The previous open-model state of the art on WebArena-Lite (38.4%) that MiRA dethrones, and a useful contrast in how to structure curricula and rewards for long-horizon web agents.