June 21, 2025

Ep.10 Are benchmarks broken?

56 minutes

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

18:20 - Automated evaluation

20:10 - Benchmarks

37:45 - Human feedback

44:50 - LLM as judge

Read more about the projects we discuss here:

Meditron

Learn about the MOOVE or contact our team if you'd like to be involved

Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper

More details in the show notes on our website.

Episodes | Bluesky | [email protected]

...more

View all episodes

By Medical Attention

June 21, 2025

Ep.10 Are benchmarks broken?

56 minutes

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

18:20 - Automated evaluation

20:10 - Benchmarks

37:45 - Human feedback

44:50 - LLM as judge

Read more about the projects we discuss here:

Meditron

Learn about the MOOVE or contact our team if you'd like to be involved

Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper

More details in the show notes on our website.

Episodes | Bluesky | [email protected]

...more

Share Ep.10 Are benchmarks broken?

Sign up to save your podcasts

Ep.10 Are benchmarks broken?

Ep.10 Are benchmarks broken?