Medical Attention

Ep.10 Are benchmarks broken?


Listen Later

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

  • 18:20 - Automated evaluation

  • 20:10 - Benchmarks

  • 37:45 - Human feedback

  • 44:50 - LLM as judge

    Read more about the projects we discuss here:

    • Meditron

    • Learn about the MOOVE or contact our team if you'd like to be involved
    • Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper

      More details in the show notes on our website.

      Episodes | Bluesky | [email protected]

      ...more
      View all episodesView all episodes
      Download on the App Store

      Medical AttentionBy Medical Attention