November 26, 2025

The Evaluation Crisis

15 minutes

"Passes the bar exam" doesn't mean AI can practice law. "Beats humans on ImageNet" doesn't mean it understands images. You'll learn why most AI benchmarks are fundamentally broken through the cautionary tale of the "infant morality study"—researchers thought babies preferred moral helpers, but they just liked bouncing balls. The Clever Hans effect is alive and well in 2025. If you're evaluating AI products, making purchasing decisions, or relying on AI benchmark claims, this episode gives you the critical thinking tools to cut through the nonsense.

Topics Covered

- Construct validity: Are we testing what we think we're testing?

- The anthropomorphism trap: projecting human limitations onto AI

- Why "passing the bar exam" doesn't mean AI can practice law

- The Clever Hans problem in modern AI

- EU AI Act and regulatory approaches

- Testing AI like we test babies and animals (alien intelligence framework)

...more