May 07, 2026

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

21 minutes

Source: Agentic Vulnerability Reasoning on Windows COM Binaries

Paper was published on May 06, 2026

This episode was AI-generated on May 7, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Microsoft just paid $140,000 in bug bounties to an autonomous agent that found 28 previously unknown vulnerabilities in shipping Windows services and wrote working exploits for them. The same frontier models verified zero exploits with their default scaffolding and 26 with the right one — making this as much a story about tool design as about security.

Key Takeaways

How slyp's two-stage 'scout then sapper' architecture goes from decompiled binary to a working proof-of-concept exploit against live Windows services

Why three purpose-built tool servers — binary explorer, COM inspector, live debugger — turn out to matter more than raw model capability

The headline result: 27 of 40 benchmark cases solved with full tooling, versus 0 of 40 for production coding agents on default settings

Real-world deployment numbers: 28 confirmed zero-days, 16 CVEs, three of them low-integrity-to-SYSTEM escalations

Why static analyzers cap out around 0.30 F1 on this bug class while semantic reasoning over decompiled code reaches 0.97

Honest limitations: benchmark circularity on most cases, 7–11 million tokens per case, and 'verified crash' is not yet weaponized RCE

00:00 — The bug class: races in Windows COM services
A walkthrough of the SetPrintTicket example showing how unlocked shared-pointer access in a multi-threaded service produces use-after-free and double-free primitives.

02:43 — Why traditional tools struggle here
Why fuzzers can't reliably hit race windows, why pattern-based static analyzers like COMRace miss bugs, and why manual reverse engineering doesn't scale.

05:27 — slyp's architecture: three tool servers behind the model
How the binary explorer, COM inspector, and dynamic debugger embed the mechanical work so the model spends tokens on semantic reasoning.

08:11 — Scout then sapper: the two-stage pipeline
How stage one produces a structured vulnerability report from binary exploration and stage two iterates compile-debug cycles to land a working exploit.

10:55 — Benchmark results and the scaffolding lesson
slyp hits 0.97 F1 on discovery and solves 27 of 40 exploit cases, while default coding agents on the same models verify zero — and the gap widens further on weaker models.

13:38 — Real-world deployment against Microsoft Windows
28 confirmed vulnerabilities, 16 CVEs, $140,000 in bounties across nine services, including three direct low-integrity-to-SYSTEM escalations.

16:22 — Steelman critiques
Benchmark circularity, the in-house static analyzer comparison, the gap between verified crash and weaponized exploit, and the per-case token cost.

19:06 — What generalizes beyond security
Why closed-source binary analysis is now in reach for agents, what the offense-defense math implies, and what the scaffolding result means for anyone building agents.

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

21 minutes

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Source: Agentic Vulnerability Reasoning on Windows COM Binaries

Paper was published on May 06, 2026

Key Takeaways

How slyp's two-stage 'scout then sapper' architecture goes from decompiled binary to a working proof-of-concept exploit against live Windows services

Why three purpose-built tool servers — binary explorer, COM inspector, live debugger — turn out to matter more than raw model capability

The headline result: 27 of 40 benchmark cases solved with full tooling, versus 0 of 40 for production coding agents on default settings

Real-world deployment numbers: 28 confirmed zero-days, 16 CVEs, three of them low-integrity-to-SYSTEM escalations

Why static analyzers cap out around 0.30 F1 on this bug class while semantic reasoning over decompiled code reaches 0.97

Honest limitations: benchmark circularity on most cases, 7–11 million tokens per case, and 'verified crash' is not yet weaponized RCE

16:22 — Steelman critiques
Benchmark circularity, the in-house static analyzer comparison, the gap between verified crash and weaponized exploit, and the per-case token cost.

Share An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Sign up to save your podcasts

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work