NeurIPS 2025 by Basis Set

The Evaluation Crisis


Listen Later

"Passes the bar exam" doesn't mean AI can practice law. "Beats humans on ImageNet" doesn't mean it understands images. You'll learn why most AI benchmarks are fundamentally broken through the cautionary tale of the "infant morality study"—researchers thought babies preferred moral helpers, but they just liked bouncing balls. The Clever Hans effect is alive and well in 2025. If you're evaluating AI products, making purchasing decisions, or relying on AI benchmark claims, this episode gives you the critical thinking tools to cut through the nonsense.
Topics Covered
- Construct validity: Are we testing what we think we're testing?
- The anthropomorphism trap: projecting human limitations onto AI
- Why "passing the bar exam" doesn't mean AI can practice law
- The Clever Hans problem in modern AI
- EU AI Act and regulatory approaches
- Testing AI like we test babies and animals (alien intelligence framework)
...more
View all episodesView all episodes
Download on the App Store

NeurIPS 2025 by Basis SetBy Basis Set

  • 5
  • 5
  • 5
  • 5
  • 5

5

8 ratings