METR: Half of SWE-Bench Passes Fail Real Code Review
METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.
METR: Half of SWE-Bench Passes Fail Real Code Review
METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.