April 18, 2025

LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection

Listen Later

27 minutes

For this week's paper read, we dive into our own research.

We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost.

So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.

We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research, or sign up to join us next time.

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

Deep Papers

By Arize AI

5

1313 ratings

April 18, 2025

LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection

Listen Later

27 minutes

For this week's paper read, we dive into our own research.

We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost.

So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models.

We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research, or sign up to join us next time.

Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.

...more

More shows like Deep Papers

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

298 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

331 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

217 Listeners

Practical AI by Practical AI LLC

Practical AI

192 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

198 Listeners

Last Week in AI by Skynet Today

Last Week in AI

298 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

426 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning by Jaeden Schafer

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

142 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

201 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

491 Listeners

AI + a16z by a16z

AI + a16z

31 Listeners

Training Data by Sequoia Capital

Training Data

43 Listeners