
Sign up to save your podcasts
Or


Meta says AI writes 80% of new code. Their own reviewers can't keep up with their own AI.
Straight from their engineering blog.
They built RADAR to auto-review low-risk diffs because "the share of diffs receiving timely review has declined." Their words. AI-generated code outpaced human review capacity.
Read that with the rest of the week's news.
Cognition says Devin merged 7x more PRs year-over-year. AI-written commits inside customer codebases jumped from 16% to 80%. Anthropic shipped Opus 4.8 on Wednesday, and every IDE, gateway, and agent runner had it the same day. They also disclosed a $47B revenue run-rate. The "is this a real business" debate is over.
But here is what keeps coming back to me:
Shipping more code faster is only a win if the systems that catch problems scale at the same rate. This week, the evidence says they aren't.
A new arXiv study of 20,574 real coding-agent sessions documents how often agents do something other than what was asked. ITBench-AA, the first serious benchmark for agentic IT work, scored every frontier model below 50%.
Adoption is real. The guardrails are not.
This week's episode of The Human in the Loop covers all of it: the shipping wave, the cost-control backlash starting inside eng departments, and why ITBench-AA matters more than the score suggests.
By Enrique CorderoMeta says AI writes 80% of new code. Their own reviewers can't keep up with their own AI.
Straight from their engineering blog.
They built RADAR to auto-review low-risk diffs because "the share of diffs receiving timely review has declined." Their words. AI-generated code outpaced human review capacity.
Read that with the rest of the week's news.
Cognition says Devin merged 7x more PRs year-over-year. AI-written commits inside customer codebases jumped from 16% to 80%. Anthropic shipped Opus 4.8 on Wednesday, and every IDE, gateway, and agent runner had it the same day. They also disclosed a $47B revenue run-rate. The "is this a real business" debate is over.
But here is what keeps coming back to me:
Shipping more code faster is only a win if the systems that catch problems scale at the same rate. This week, the evidence says they aren't.
A new arXiv study of 20,574 real coding-agent sessions documents how often agents do something other than what was asked. ITBench-AA, the first serious benchmark for agentic IT work, scored every frontier model below 50%.
Adoption is real. The guardrails are not.
This week's episode of The Human in the Loop covers all of it: the shipping wave, the cost-control backlash starting inside eng departments, and why ITBench-AA matters more than the score suggests.