Agent Mode AI

The three questions every CIO should ask about a vendor accuracy claim


Listen Later

Episode 9 of Agent Mode AI. Abby and Avery walk AM-146, the claim that vendor "ready-to-run" positioning without named task, named baseline, and named methodology is procurement-deck noise rather than procurement evidence. The procurement-grade reference shapes in 2026 are the academic-benchmark layer (CRMArena-Pro 35% multi-step reliability, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling, SWE-bench Verified for code generation) and the Anthropic Claude for Chrome disclosure pattern (23.6% pre-mitigation, 11.2% post, 0% on URL-injection variants after patches). A third class — the named-customer audited deployment, with McKinsey Lilli, JPMorgan, BT Now Assist, and UK Government Digital Service as the canonical references — sits alongside.
Sources cited:
- CRMArena-Pro paper, Salesforce AI Research, August 2025
- Carnegie Mellon TheAgentCompany academic benchmark
- WebArena academic benchmark
- SWE-bench Verified
- Anthropic published security disclosure on Claude for Chrome, 26 August 2025
- McKinsey internal Lilli platform deployment data
- JPMorgan Chase 2023 AI value disclosure
- BT Now Assist deployment, Hena Jalil
- UK Government Digital Service Q4 2024
Claims tracked:
- AM-146 — Three accuracy-disclosure questions for procurement — agentmodeai.com/holding/?claim=AM-146
- AM-009 — Claude for Chrome procurement-grade disclosure pattern — agentmodeai.com/holding/?claim=AM-009
- AM-140 — Procurement-committee pre-pilot questions — agentmodeai.com/holding/?claim=AM-140
Newsletter and the full Holding-up ledger: agentmodeai.com
...more
View all episodesView all episodes
Download on the App Store

Agent Mode AIBy Agent Mode AI