Lenny's Reads

Listen: Building eval systems that improve your AI product


Listen Later

If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com

In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.

After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.

In this episode, you’ll learn:

• Why most AI eval dashboards fail to deliver real product improvements

• How to use error analysis to uncover your product’s most critical failure modes

• The role of a “principal domain expert” in setting a consistent quality bar

• Techniques for transforming messy error notes into a clean taxonomy of failures

• When to use code-based checks vs. LLM-as-a-judge evaluators

• How to build trust in your evals with human-labeled ground-truth datasets

• Why binary pass/fail labels outperform Likert scales in practice

• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows

• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement

References:

• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals

• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/

• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272

• Aman Khan: https://www.linkedin.com/in/amanberkeley/

• Anthropic: https://www.anthropic.com/

• Arize Phoenix: https://phoenix.arize.com/

• Braintrust: https://www.braintrust.dev/

• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete

• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/

• Hamel Husain: https://www.linkedin.com/in/hamelhusain/

• LangSmith: https://smith.langchain.com/

• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html

• OpenAI: https://openai.com/

• Shreya Shankar: https://www.linkedin.com/in/shrshnk/

Listen:

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Newsletter: https://www.lennysnewsletter.com/subscribe

Follow Lenny:

• Twitter/X: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

Subscribe

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Substack: https://lennysreads.com/

Follow Lenny

• Twitter: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

About

Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.



To hear more, visit www.lennysnewsletter.com
...more
View all episodesView all episodes
Download on the App Store

Lenny's ReadsBy Lenny Rachitsky

  • 4.3
  • 4.3
  • 4.3
  • 4.3
  • 4.3

4.3

6 ratings


More shows like Lenny's Reads

View all
Planet Money by NPR

Planet Money

30,609 Listeners

Pivot by New York Magazine

Pivot

9,724 Listeners

Founders by David Senra

Founders

2,221 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

343 Listeners

The Daily by The New York Times

The Daily

113,121 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

233 Listeners

My First Million by Hubspot Media

My First Million

2,660 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,254 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

551 Listeners

Hard Fork by The New York Times

Hard Fork

5,576 Listeners

Coaching Real Leaders by Muriel Wilkins

Coaching Real Leaders

678 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

150 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

688 Listeners

The Next Wave - AI and The Future of Technology by Mindstream (Hubspot Media)

The Next Wave - AI and The Future of Technology

55 Listeners

How I AI by Claire Vo

How I AI

158 Listeners