May 10, 2026

The three questions every CIO should ask about a vendor accuracy claim

12 minutes

Episode 9 of Agent Mode AI. Abby and Avery walk AM-146, the claim that vendor "ready-to-run" positioning without named task, named baseline, and named methodology is procurement-deck noise rather than procurement evidence. The procurement-grade reference shapes in 2026 are the academic-benchmark layer (CRMArena-Pro 35% multi-step reliability, CMU TheAgentCompany 30-35% reproduction range, WebArena ~36% browser-agent ceiling, SWE-bench Verified for code generation) and the Anthropic Claude for Chrome disclosure pattern (23.6% pre-mitigation, 11.2% post, 0% on URL-injection variants after patches). A third class — the named-customer audited deployment, with McKinsey Lilli, JPMorgan, BT Now Assist, and UK Government Digital Service as the canonical references — sits alongside.

Sources cited:

- CRMArena-Pro paper, Salesforce AI Research, August 2025

- Carnegie Mellon TheAgentCompany academic benchmark

- WebArena academic benchmark

- SWE-bench Verified

- Anthropic published security disclosure on Claude for Chrome, 26 August 2025

- McKinsey internal Lilli platform deployment data

- JPMorgan Chase 2023 AI value disclosure

- BT Now Assist deployment, Hena Jalil

- UK Government Digital Service Q4 2024

Claims tracked:

- AM-146 — Three accuracy-disclosure questions for procurement — agentmodeai.com/holding/?claim=AM-146

- AM-009 — Claude for Chrome procurement-grade disclosure pattern — agentmodeai.com/holding/?claim=AM-009

- AM-140 — Procurement-committee pre-pilot questions — agentmodeai.com/holding/?claim=AM-140

Newsletter and the full Holding-up ledger: agentmodeai.com

...more

View all episodes

By Agent Mode AI

May 10, 2026

The three questions every CIO should ask about a vendor accuracy claim

12 minutes

Sources cited:

- CRMArena-Pro paper, Salesforce AI Research, August 2025

- Carnegie Mellon TheAgentCompany academic benchmark

- WebArena academic benchmark

- SWE-bench Verified

- Anthropic published security disclosure on Claude for Chrome, 26 August 2025

- McKinsey internal Lilli platform deployment data

- JPMorgan Chase 2023 AI value disclosure

- BT Now Assist deployment, Hena Jalil

- UK Government Digital Service Q4 2024

Claims tracked:

- AM-146 — Three accuracy-disclosure questions for procurement — agentmodeai.com/holding/?claim=AM-146

- AM-009 — Claude for Chrome procurement-grade disclosure pattern — agentmodeai.com/holding/?claim=AM-009

- AM-140 — Procurement-committee pre-pilot questions — agentmodeai.com/holding/?claim=AM-140

Newsletter and the full Holding-up ledger: agentmodeai.com

...more

Share The three questions every CIO should ask about a vendor accuracy claim

Sign up to save your podcasts

The three questions every CIO should ask about a vendor accuracy claim

The three questions every CIO should ask about a vendor accuracy claim