AI Papers: A Deep Dive

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work


Listen Later

An AI Agent That Found 28 Zero-Days in Windows — And What Made It Work

Source: Agentic Vulnerability Reasoning on Windows COM Binaries

Paper was published on May 06, 2026

This episode was AI-generated on May 7, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

Microsoft just paid $140,000 in bug bounties to an autonomous agent that found 28 previously unknown vulnerabilities in shipping Windows services and wrote working exploits for them. The same frontier models verified zero exploits with their default scaffolding and 26 with the right one — making this as much a story about tool design as about security.

Key Takeaways
  • How slyp's two-stage 'scout then sapper' architecture goes from decompiled binary to a working proof-of-concept exploit against live Windows services
  • Why three purpose-built tool servers — binary explorer, COM inspector, live debugger — turn out to matter more than raw model capability
  • The headline result: 27 of 40 benchmark cases solved with full tooling, versus 0 of 40 for production coding agents on default settings
  • Real-world deployment numbers: 28 confirmed zero-days, 16 CVEs, three of them low-integrity-to-SYSTEM escalations
  • Why static analyzers cap out around 0.30 F1 on this bug class while semantic reasoning over decompiled code reaches 0.97
  • Honest limitations: benchmark circularity on most cases, 7–11 million tokens per case, and 'verified crash' is not yet weaponized RCE
    • 00:00 — The bug class: races in Windows COM services
      A walkthrough of the SetPrintTicket example showing how unlocked shared-pointer access in a multi-threaded service produces use-after-free and double-free primitives.
    • 02:43 — Why traditional tools struggle here
      Why fuzzers can't reliably hit race windows, why pattern-based static analyzers like COMRace miss bugs, and why manual reverse engineering doesn't scale.
    • 05:27 — slyp's architecture: three tool servers behind the model
      How the binary explorer, COM inspector, and dynamic debugger embed the mechanical work so the model spends tokens on semantic reasoning.
    • 08:11 — Scout then sapper: the two-stage pipeline
      How stage one produces a structured vulnerability report from binary exploration and stage two iterates compile-debug cycles to land a working exploit.
    • 10:55 — Benchmark results and the scaffolding lesson
      slyp hits 0.97 F1 on discovery and solves 27 of 40 exploit cases, while default coding agents on the same models verify zero — and the gap widens further on weaker models.
    • 13:38 — Real-world deployment against Microsoft Windows
      28 confirmed vulnerabilities, 16 CVEs, $140,000 in bounties across nine services, including three direct low-integrity-to-SYSTEM escalations.
    • 16:22 — Steelman critiques
      Benchmark circularity, the in-house static analyzer comparison, the gap between verified crash and weaponized exploit, and the per-case token cost.
    • 19:06 — What generalizes beyond security
      Why closed-source binary analysis is now in reach for agents, what the offense-defense math implies, and what the scaffolding result means for anyone building agents.
    • Recommended Reading
      • SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Makes the same scaffolding-matters argument the episode highlights — that the interface between an LLM and its tools, not the model alone, determines agent capability.
      • Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models — Google Project Zero's framework for LLM-driven vulnerability research, a direct point of comparison for slyp's binary-explorer-plus-debugger architecture in the offensive security agent space.
      • Teams of LLM Agents can Exploit Zero-Day Vulnerabilities — Earlier evidence for the offense-defense asymmetry the episode raises, focused on web vulnerabilities rather than closed-source Windows binaries.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai