Machine Learning Street Talk (MLST)

Are AI Benchmarks Telling The Full Story? [SPONSORED] (Andrew Gordon and Nora Petrova - Prolific)


Listen Later

Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: **the human experience.**


Why High Benchmark Scores Don’t Mean Better AI


Joining us are **Andrew Gordon** (Staff Researcher in Behavioral Science) and **Nora Petrova** (AI Researcher) from **Prolific**. They reveal the hidden flaws in how we currently rank AI and introduce a more rigorous, "humane" way to measure whether these models are actually helpful, safe, and relatable for real people.


---


Key Insights in This Episode:


* *The F1 Car Analogy:* Andrew explains why a model that excels at the "Humanities Last Exam" might be a nightmare for daily use. Technical benchmarks often ignore the nuances of human communication and adaptability.

* *The "Wild West" of AI Safety:* As users turn to AI for sensitive topics like mental health, Nora highlights the alarming lack of oversight and the "thin veneer" of safety training—citing recent controversial incidents like Grok-3’s "Mecha Hitler."

* *Fixing the "Leaderboard Illusion":* The team critiques current popular rankings like Chatbot Arena, discussing how anonymous, unstratified voting can lead to biased results and how companies can "game" the system.

* *The Xbox Secret to AI Ranking:* Discover how Prolific uses *TrueSkill*—the same algorithm Microsoft developed for Xbox Live matchmaking—to create a fairer, more statistically sound leaderboard for LLMs.

* *The Personality Gap:* Early data from the **Humane Leaderboard** suggests that while AI is getting smarter, it is actually performing *worse* on metrics like personality, culture, and "sycophancy" (the tendency for models to become annoying "people-pleasers").


---


About the HUMAINE Leaderboard

Moving beyond simple "A vs. B" testing, the researchers discuss their new framework that samples participants based on *census data* (Age, Ethnicity, Political Alignment). By using a representative sample of the general public rather than just tech enthusiasts, they are building a standard that reflects the values of the real world.


*Are we building models for benchmarks, or are we building them for humans? It’s time to change the scoreboard.*


Rescript link:

https://app.rescript.info/public/share/IDqwjY9Q43S22qSgL5EkWGFymJwZ3SVxvrfpgHZLXQc


---

TIMESTAMPS:

00:00:00 Introduction & The Benchmarking Problem

00:01:58 The Fractured State of AI Evaluation

00:03:54 AI Safety & Interpretability

00:05:45 Bias in Chatbot Arena

00:06:45 Prolific's Three Pillars Approach

00:09:01 TrueSkill Ranking & Efficient Sampling

00:12:04 Census-Based Representative Sampling

00:13:00 Key Findings: Culture, Personality & Sycophancy


---

REFERENCES:

Paper:

[00:00:15] MMLU

https://arxiv.org/abs/2009.03300

[00:05:10] Constitutional AI

https://arxiv.org/abs/2212.08073

[00:06:45] The Leaderboard Illusion

https://arxiv.org/abs/2504.20879

[00:09:41] HUMAINE Framework Paper

https://huggingface.co/blog/ProlificAI/humaine-framework

Company:

[00:00:30] Prolific

https://www.prolific.com

[00:01:45] Chatbot Arena

https://lmarena.ai/

Person:

[00:00:35] Andrew Gordon

https://www.linkedin.com/in/andrew-gordon-03879919a/

[00:00:45] Nora Petrova

https://www.linkedin.com/in/nora-petrova/

Event:

Algorithm:

[00:09:01] Microsoft TrueSkill

https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/

Leaderboard:

[00:09:21] Prolific HUMAINE Leaderboard

https://www.prolific.com/humaine

[00:09:31] HUMAINE HuggingFace Space

https://huggingface.co/spaces/ProlificAI/humaine-leaderboard

[00:10:21] Prolific AI Leaderboard Portal

https://www.prolific.com/leaderboard

Dataset:

[00:09:51] Prolific Social Reasoning RLHF Dataset

https://huggingface.co/datasets/ProlificAI/social-reasoning-rlhf

Organization:

[00:10:31] MLCommons

https://mlcommons.org/

...more
View all episodesView all episodes
Download on the App Store

Machine Learning Street Talk (MLST)By Machine Learning Street Talk (MLST)

  • 4.7
  • 4.7
  • 4.7
  • 4.7
  • 4.7

4.7

90 ratings


More shows like Machine Learning Street Talk (MLST)

View all
Data Skeptic by Kyle Polich

Data Skeptic

479 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,099 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

303 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

347 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

226 Listeners

Practical AI by Practical AI LLC

Practical AI

205 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

97 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

210 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

522 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

502 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

133 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

AI + a16z by a16z

AI + a16z

35 Listeners

Training Data by Sequoia Capital

Training Data

41 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

134 Listeners