Lenny's Reads

Building eval systems that improve your AI product


Listen Later

If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com

In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.

After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.

In this episode, you’ll learn:

• Why most AI eval dashboards fail to deliver real product improvements

• How to use error analysis to uncover your product’s most critical failure modes

• The role of a “principal domain expert” in setting a consistent quality bar

• Techniques for transforming messy error notes into a clean taxonomy of failures

• When to use code-based checks vs. LLM-as-a-judge evaluators

• How to build trust in your evals with human-labeled ground-truth datasets

• Why binary pass/fail labels outperform Likert scales in practice

• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows

• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement

References:

• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals

• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/

• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272

• Aman Khan: https://www.linkedin.com/in/amanberkeley/

• Anthropic: https://www.anthropic.com/

• Arize Phoenix: https://phoenix.arize.com/

• Braintrust: https://www.braintrust.dev/

• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete

• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/

• Hamel Husain: https://www.linkedin.com/in/hamelhusain/

• LangSmith: https://smith.langchain.com/

• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html

• OpenAI: https://openai.com/

• Shreya Shankar: https://www.linkedin.com/in/shrshnk/

Listen:

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Newsletter: https://www.lennysnewsletter.com/subscribe

Follow Lenny:

• Twitter/X: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

Subscribe

• YouTube: https://www.youtube.com/@lennysreads

• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693

• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj

• Substack: https://lennysreads.com/

Follow Lenny

• Twitter: https://twitter.com/lennysan

• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/

• Podcast: https://www.youtube.com/@lennyspodcast

About

Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.



To hear more, visit www.lennysnewsletter.com
...more
View all episodesView all episodes
Download on the App Store

Lenny's ReadsBy Lenny Rachitsky

  • 4.3
  • 4.3
  • 4.3
  • 4.3
  • 4.3

4.3

6 ratings


More shows like Lenny's Reads

View all
The Tim Ferriss Show by Tim Ferriss: Bestselling Author, Human Guinea Pig

The Tim Ferriss Show

16,127 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

534 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,091 Listeners

Invest Like the Best with Patrick O'Shaughnessy by Colossus | Investing & Business Podcasts

Invest Like the Best with Patrick O'Shaughnessy

2,354 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

228 Listeners

Naval by Naval

Naval

2,123 Listeners

My First Million by Hubspot Media

My First Million

2,649 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

519 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

132 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

617 Listeners

AI and I by Dan Shipper

AI and I

33 Listeners

Lightcone Podcast by Y Combinator

Lightcone Podcast

22 Listeners

TBPN by John Coogan & Jordi Hays

TBPN

121 Listeners

OpenAI Podcast by OpenAI

OpenAI Podcast

52 Listeners

David Senra by Scicomm Media

David Senra

133 Listeners