March 15, 2026

LLM Agents Reason About Code Without Running It

25 minutes

This episode of AI Post Transformers examines "Agentic Code Reasoning" by Shubham Ugare and Satish Chandra from Meta, which introduces semi-formal reasoning certificates as an inference-time scaffold for LLM agents analyzing code without executing it. Rather than letting a model produce free-form chain-of-thought verdicts, the certificate framework requires the agent to state explicit premises, trace execution paths through real repository code, and produce a structured, auditable reasoning record for every claim it makes about code behavior. The Django bug django-13670 — involving two-digit year formatting for years before 1000 CE — anchors the discussion: two patches both claim to fix the same issue, but unstated assumptions about name resolution cause an unstructured model to misidentify which one is correct. The certificate format forces the agent to chase the actual import chain across modules rather than guess based on a function name, turning premise verification into a natural driver of interprocedural analysis. Hosts Hal Turing and Dr. Ada Shannon situate the paper against the spectrum from fully formal proof assistants like Lean and Coq — which are provably correct but completely impractical for arbitrary repository code — down to unstructured LLM judges like CodeJudge and SWE-RM, which let the model skip edge cases and produce confident wrong answers. The certificate sits between those extremes, imposing enough structure to make implicit assumptions visible without requiring formalized language semantics.

The episode traces how the agentic setup amplifies the value of the certificate structure. Using a minimal SWE-agent configuration with bash tool access but no code execution, the agent can navigate the file system, run grep queries, and follow import chains — exploration scope without runtime confirmation. That constraint is precisely where interprocedural tracing becomes load-bearing: the agent cannot run the code to confirm a hypothesis, so it must read the actual call chain to know what a function does rather than infer from its name. The certificate makes that tracing explicit and auditable, which opens a secondary use case beyond RL reward signal generation: automated code review where a human auditor can inspect the agent's reasoning chain rather than accept a black-box verdict. Hal and Ada discuss RL training pipelines as the paper's stated primary motivation — execution-free reward signals could meaningfully reduce the cost of running sandboxed test suites at scale — but are careful to position that as a downstream consequence of the certificate's properties rather than its defining contribution.

The episode closes on three open problems the paper leaves unresolved. First, the inference cost gap: the certificate framework adds computation at inference time, but the paper reports no latency measurements, no tokens-per-certificate data, and no comparison against unstructured baselines on cost — making it impossible to assess whether the accuracy gains justify the overhead in production. Second, certificate reuse as a concrete future direction: common interprocedural patterns across a codebase — frequently called utilities, stable library interfaces — could in principle be cached and reused across multiple verification queries, amortizing the inference cost that the paper never measures. Third, verification independence: the paper's circular verification problem remains open, since the same model that generates a certificate is also the model best positioned to judge whether the premises in that certificate are sound. Separating generation from verification — whether through a distinct model, a symbolic checker, or a human auditor — is the structural fix the framework points toward but does not yet provide.

Sources:

1. Agentic Code Reasoning — Shubham Ugare, Satish Chandra, 2026

http://arxiv.org/abs/2603.01896

2. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez et al., 2024

https://scholar.google.com/scholar?q=SWE-bench:+Can+Language+Models+Resolve+Real-World+GitHub+Issues?

3. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al., 2022

https://scholar.google.com/scholar?q=Chain-of-Thought+Prompting+Elicits+Reasoning+in+Large+Language+Models

4. Program Equivalence — Godlin and Strichman, 2008

https://scholar.google.com/scholar?q=Program+Equivalence

5. LLM-based agents for automated software engineering: A survey — Multiple authors, 2024-2025

https://scholar.google.com/scholar?q=LLM-based+agents+for+automated+software+engineering:+A+survey

6. CodeJudge: Evaluating Code Generation with Large Language Models — Tong and Zhang, 2024

https://scholar.google.com/scholar?q=CodeJudge:+Evaluating+Code+Generation+with+Large+Language+Models

7. On designing effective RL reward at training time for LLM reasoning — approximate, 2024-2025, 2024-2025

https://scholar.google.com/scholar?q=On+designing+effective+RL+reward+at+training+time+for+LLM+reasoning

8. Large language model critics for execution-free evaluation of code changes — approximate, 2024-2025, 2024-2025

https://scholar.google.com/scholar?q=Large+language+model+critics+for+execution-free+evaluation+of+code+changes

9. AgentFL: Scaling LLM-based fault localization to project-level context — approximate, 2024-2025, 2024-2025

https://scholar.google.com/scholar?q=AgentFL:+Scaling+LLM-based+fault+localization+to+project-level+context

10. SoapFL: A Standard Operating Procedure for LLM-based Method-Level Fault Localization — approximate, 2024-2025, 2024-2025

https://scholar.google.com/scholar?q=SoapFL:+A+Standard+Operating+Procedure+for+LLM-based+Method-Level+Fault+Localization

11. Structured chain-of-thought prompting for code generation — approximate, 2022-2024, 2022-2024

https://scholar.google.com/scholar?q=Structured+chain-of-thought+prompting+for+code+generation

12. Deductive verification of chain-of-thought reasoning — approximate, 2023-2024, 2023-2024

https://scholar.google.com/scholar?q=Deductive+verification+of+chain-of-thought+reasoning

13. AI Post Transformers: Reasoning About Code Without Running It — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-08-reasoning-about-code-without-running-it-a9d01a.mp3

14. AI Post Transformers: Gradient Descent at Inference Time for LLM Reasoning — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-10-gradient-descent-at-inference-time-for-l-20617d.mp3

Interactive Visualization: LLM Agents Reason About Code Without Running It

...more

View all episodes

By mcgrof