July 08, 2025

903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

Listen Later

1 hour 28 minutes

Has AI benchmarking reached its limit, and what do we have to fill this gap? Sinan Ozdemir speaks to Jon Krohn about the lack of transparency in training data and the necessity of human-led quality assurance to detect AI hallucinations, when and why to be skeptical of AI benchmarks, and the future of benchmarking agentic and multimodal models.

Additional materials: ⁠⁠⁠⁠⁠www.superdatascience.com/903⁠⁠⁠⁠

This episode is brought to you by Trainium2, the latest AI chip from AWS, by ⁠⁠Adverity, the conversational analytics platform⁠⁠ and by the ⁠⁠Dell AI Factory with NVIDIA⁠⁠.

Interested in sponsoring a SuperDataScience Podcast episode? Email [email protected] for sponsorship information.

In this episode you will learn:

(16:48) Sinan’s new podcast, Practically Intelligent

(21:54) What to know about the limits of AI benchmarking

(53:22) Alternatives to AI benchmarks

(1:01:23) The difficulties in getting a model to recognize its mistakes

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

Super Data Science: ML & AI Podcast with Jon Krohn

By Jon Krohn

4.6

295295 ratings

July 08, 2025

903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

Listen Later

1 hour 28 minutes

Has AI benchmarking reached its limit, and what do we have to fill this gap? Sinan Ozdemir speaks to Jon Krohn about the lack of transparency in training data and the necessity of human-led quality assurance to detect AI hallucinations, when and why to be skeptical of AI benchmarks, and the future of benchmarking agentic and multimodal models.

Additional materials: ⁠⁠⁠⁠⁠www.superdatascience.com/903⁠⁠⁠⁠

This episode is brought to you by Trainium2, the latest AI chip from AWS, by ⁠⁠Adverity, the conversational analytics platform⁠⁠ and by the ⁠⁠Dell AI Factory with NVIDIA⁠⁠.

Interested in sponsoring a SuperDataScience Podcast episode? Email [email protected] for sponsorship information.

In this episode you will learn:

(16:48) Sinan’s new podcast, Practically Intelligent

(21:54) What to know about the limits of AI benchmarking

(53:22) Alternatives to AI benchmarks

(1:01:23) The difficulties in getting a model to recognize its mistakes

...more

More shows like Super Data Science: ML & AI Podcast with Jon Krohn

Data Skeptic by Kyle Polich

Data Skeptic

479 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

624 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

585 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

332 Listeners

AI Today Podcast by AI & Data Today

AI Today Podcast

152 Listeners

DataFramed by DataCamp

DataFramed

269 Listeners

Practical AI by Practical AI LLC

Practical AI

210 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

142 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

95 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

135 Listeners

AI Chat: ChatGPT, AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT, AI News, Artificial Intelligence, OpenAI, Machine Learning

152 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

225 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

607 Listeners

AI For Humans: Making Artificial Intelligence Fun & Practical by Kevin Pereira & Gavin Purcell

AI For Humans: Making Artificial Intelligence Fun & Practical

272 Listeners

Training Data by Sequoia Capital

Training Data

39 Listeners