Why AI Coding Agents Keep Trying to Debug Without a Debugger
Source: Dynamic analysis enhances issue resolution
Paper was published on March 23, 2026
This episode was AI-generated on May 2, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Today's AI coding agents try to fix bugs by reading code — never by watching it run. A new paper argues that's the wrong half of what human engineers actually do, and shows that giving agents real execution traces produces fixes that are not just more accurate but systemic instead of band-aid. The quiet corroboration: agents that can see what code does end up reading less of it.
Key Takeaways
Why the bottleneck for AI coding agents may be perception, not reasoning — they're being asked to deduce runtime behavior from static textHow DAIRA's 'trigger-and-collect' tracer plus an indented-tree reformatter beat dumping raw traces into the model — an ablation that's the gem of the paperThe SymPy case study where dynamic visibility led the agent to a systemic fix instead of a defensive patch on the symptomThe token paradox: adding trace context cuts total input tokens by about 25% because the agent stops fishing through filesWhy the headline 79.4% on SWE-bench Verified is partly a backbone-choice story, and what the cleaner controlled comparison actually showsWhere the dynamic-analysis story gets harder: bugs without clean reproductions, and small denominators on the hardest task tier00:00 — The missing half of debugging
Why human engineers reach for a debugger first, and why current coding agents skip that step entirely.02:35 — The Matplotlib case: symptom far from cause
A small motivating bug where a static-reading agent flails through unrelated files while a trace-equipped agent walks straight to the faulty classifier.05:11 — The SymPy case: defensive fix vs. systemic fix
A polymorphic-dispatch nightmare where dynamic analysis lets the agent fix the cause instead of band-aiding the symptom.08:35 — How DAIRA actually works
The three components — tracer, reformatter, workflow — and why the design keeps cognitive load on the agent low.10:22 — The killer ablation: raw traces don't help
Feeding the firehose to the model performs at baseline; the indented-tree reformatting is doing nearly all the work.12:58 — The token paradox and three model personalities
Why better information cuts total context use, and how Qwen, Gemini, and DeepSeek each spend the savings differently.15:34 — What the critique looks like
Backbone mismatches in the headline number, benchmark generosity, an LLM in the reformatter loop, and small denominators on hard tasks.18:09 — The durable lesson
Sometimes the right move isn't smarter reasoning machinery — it's giving the model a window into what the system is actually doing.Recommended Reading
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — The benchmark the episode's headline 79% number is measured on — essential context for understanding what 'resolving an issue' actually means here.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — The foundational static-reading agent that DAIRA's controlled head-to-head comparison is built on top of, and the system whose limitations motivate adding runtime observability.Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks — Empirical evidence for the episode's core claim that LLMs struggle to mentally simulate code execution, motivating why externalizing runtime behavior into traces helps.Debug-gym: A Text-Based Environment for Interactive Debugging — A complementary line of work giving LLM agents access to actual debugger primitives like breakpoints — a useful contrast to DAIRA's lighter trigger-and-collect tracing approach.