December 08, 2025

Beyond the Benchmark - How do we test a 'superhuman' doctor

6 minutes

Reference: Gallifant, J. & Bitterman, D.S. (2025). Humanity’s Next Medical Exam: Preparing to Evaluate Superhuman Systems. NEJM AI, 2(11). DOI: 10.1056/AIe2501008

When an AI scores 100% on a medical exam but can't navigate a hospital ward, is it really a doctor?

Today, we break down a new editorial from NEJM AI by Gallifant and Bitterman. We explore the transition from "recall" to "reasoning" and why the future of AI safety lies in "Interactive Interrogation" and high-fidelity sandboxes.

The models are becoming superhuman. It’s time our tests caught up.

Further recommended listening: https://www.youtube.com/watch?v=yQLOicn2vPU

#ai in medicine Music generated by Mubert https://mubert.com/render

[email protected]

...more

View all episodes

By Stephen Auger

December 08, 2025

Beyond the Benchmark - How do we test a 'superhuman' doctor

6 minutes

Reference: Gallifant, J. & Bitterman, D.S. (2025). Humanity’s Next Medical Exam: Preparing to Evaluate Superhuman Systems. NEJM AI, 2(11). DOI: 10.1056/AIe2501008

When an AI scores 100% on a medical exam but can't navigate a hospital ward, is it really a doctor?

The models are becoming superhuman. It’s time our tests caught up.

Further recommended listening: https://www.youtube.com/watch?v=yQLOicn2vPU

#ai in medicine Music generated by Mubert https://mubert.com/render

[email protected]

...more

Share Beyond the Benchmark - How do we test a 'superhuman' doctor

Sign up to save your podcasts

Beyond the Benchmark - How do we test a 'superhuman' doctor

Beyond the Benchmark - How do we test a 'superhuman' doctor