Teaching a Phone Agent to Reason Silently, And Keeping It Honest
Source: MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Paper was published on June 03, 2026
This episode was AI-generated on June 4, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Good mobile AI agents write a paragraph of reasoning before every tap, which makes them smart but painfully slow. This episode unpacks MIRAGE, which moves that reasoning into silent hidden vectors, parallelizes it with a century-old numerical trick, and forces it to stay sharp by predicting the next screen, matching the quality of written reasoning at roughly a fifth of the cost.
Key Takeaways
Why stripping reasoning out of an agent doesn't just remove a bonus but actively drops it below the untouched base model (42.9 to 31)How APLR borrows Jacobi iteration to parallelize sequential latent reasoning with a provable guarantee that the first K thought-slots are exactThe trick that keeps invisible reasoning honest: a throwaway 'world model' head that forces the silent slots to predict the next screen's features during training onlyHow the ablation table tells the whole thesis in five numbers, with the world model recovering the chain-of-thought score (52.6) to the decimalWhere the headline 'matches chain-of-thought' claim is fragile: it rests on a tie at a single benchmark number, and the slot-specialization story is shown correlationally, not provenWhy the latent scratchpad isn't free, dropping from nine slots to three craters success from 52.6 to 32.800:00 — The cost of agents that narrate every tap
Why step-by-step reasoning helps mobile agents but makes each action slow and verbose, and what MIRAGE claims to fix.03:01 — Reasoning without words
How a model can think in continuous hidden vectors instead of generating text, building on the earlier Coconut approach.06:02 — APLR and the Jacobi iteration trick
Using the one-way dependency structure of causal attention to parallelize latent reasoning with a provable correctness guarantee.09:03 — The world model that keeps silent reasoning honest
A lightweight head that forces the under-supervised thought-slots to predict next-screen features during training, then gets discarded at inference.12:04 — Two-stage training and why ordering matters
First teaching the shape of good reasoning out loud, then migrating it into silent latent slots.15:05 — The ablation table, five numbers that carry the argument
Walking through the AndroidWorld results from removing reasoning entirely up to full MIRAGE recovering the chain-of-thought score.18:06 — Where the claims are fragile
Steelman critiques on the single-number tie, the correlational slot-specialization story, and what 'world model' really means here.21:07 — What travels beyond phones
The reframe of where reasoning should live and why the parallelization trick should generalize to other causal computations.Recommended Reading
Training Large Language Models to Reason in a Continuous Latent Space — The 'Coconut' paper named in the episode as MIRAGE's direct ancestor — the work that first taught models to reason in continuous vectors instead of words, and whose serial-slot bottleneck APLR was designed to fix.AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents — The live on-device benchmark of 116 task instances across 20 apps that anchors every headline number in this episode's ablation table.Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — The foundational case for the visible 'show your work' reasoning that MIRAGE tries to match while making it silent — the explicit chain-of-thought baseline the whole paper measures itself against.AndroidControl: A Dataset for Mobile Device Control — The static, ground-truth-action benchmark behind the episode's 'cleanest single line' — 75% to 91% low-level action accuracy at one-sixth the tokens.