January 23, 2025

AI testing, benchmarks and evals

Listen Later

36 minutes

Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.

To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.

Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani): https://www.thoughtworks.com/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

Thoughtworks Technology Podcast

By Thoughtworks

4.4

4343 ratings

January 23, 2025

AI testing, benchmarks and evals

Listen Later

36 minutes

Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.

To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.

Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani): https://www.thoughtworks.com/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests

...more

More shows like Thoughtworks Technology Podcast

The Changelog: Software Development, Open Source by Changelog Media

The Changelog: Software Development, Open Source

289 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,093 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

626 Listeners

Talk Python To Me by Michael Kennedy

Talk Python To Me

583 Listeners

Soft Skills Engineering by Jamison Dance and Dave Smith

Soft Skills Engineering

288 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

228 Listeners

Syntax - Tasty Web Development Treats by Wes Bos & Scott Tolinski - Full Stack JavaScript Web Developers

Syntax - Tasty Web Development Treats

982 Listeners

Practical AI by Daniel Whitenack and Chris Benson

Practical AI

208 Listeners

AWS Podcast by Amazon Web Services

AWS Podcast

203 Listeners

Pragmatism in Practice by Thoughtworks

Pragmatism in Practice

10 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

202 Listeners

Tecnología y Negocios by Thoughtworks

Tecnología y Negocios

1 Listeners

Me, Myself, and AI by MIT Sloan Management Review

Me, Myself, and AI

109 Listeners

Hablando de software by Thoughtworks

Hablando de software

0 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

226 Listeners

AI + a16z by a16z

AI + a16z

34 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

74 Listeners