Why Frontier Agents Ask for Clarification at Exactly the Wrong Moment
Source: Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Paper was published on May 08, 2026
This episode was AI-generated on May 11, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A clarifying question worth nothing at action thirty can be worth almost everything at action three — and a new paper draws the empirical curves to prove it. The kicker: none of the frontier models tested ask at the right time. GPT over-asks late, Gemini never asks at all, and the model that succeeds most is the one that asks least.
Key Takeaways
Why clarification value isn't a single threshold but four different decay curves — one each for goal, input, constraint, and context ambiguitiesThe forced-injection experimental design that isolates timing from noticing, and why disabling the ask-tool was the key methodological moveThe cliff for goal information: catching it at 10% of the trajectory recovers nearly oracle-level performance; catching it at 70% is worthlessWhy late constraint clarifications can be actively destructive — worse than never asking — and the budget-report rounding example that shows itHow GPT-5.2, Claude Sonnet 4.5, and Gemini 3 Flash each miss the optimal window in completely different ways, and why Claude's 'ask less' strategy outperformsWhere the paper's statistical claims are thinner than the prose suggests, and the salience confound that may inflate the value-of-information curves00:00 — The fiscal-vs-calendar quarter problem
Why clarification timing — not just whether to ask — is the question nobody had measured, and the opening example that motivates the paper.03:15 — Forced injection: testing the fire department, not the smoke detector
How the authors isolated timing by disabling the ask tool and injecting synthetic clarifications at calibrated percentages of an oracle-derived action budget.06:31 — Four kinds of missing information
Goal, input, constraint, and context — and the prediction that each should commit at a different rate and therefore decay differently.09:46 — The empirical curves: a cliff, a slope, and a danger zone
What 84 tasks, four models, and 6,000+ trials revealed about how the value of clarification decays over a trajectory.13:02 — The natural-ask study: nobody hits the window
What happens when the ask-tool is turned back on — and the striking finding that the model asking least succeeds most.16:18 — Clarification as a typed, time-sensitive resource
Why the single-confidence-threshold framing has the wrong shape, and what a typed gate would look like instead.19:33 — Where the paper's claims are thinner than they sound
Salience confounds, benchmark floor effects, tiny sample sizes for context, and the limits of what forced injection can actually tell us.22:49 — Porting old ideas into the agent era
How this work retrofits decades-old findings from decision theory and HCI interruption research onto long-horizon LLM agents.26:04 — What a builder does with this
Practical implications for product teams, the domain-specific calibration work that doesn't yet exist, and where the research direction points next.Recommended Reading
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark family behind SWE-Bench Pro, one of the three testbeds used to draw the clarification timing curves discussed in the episode.TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks — The enterprise-workflow benchmark where the episode notes floor effects muddied the timing signal — useful context for why some of the paper's curves are cleaner than others.