Lenny's Podcast: Product | Career | Growth

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)


Listen Later

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference between code-based evals and LLM-as-judge

(52:10) Example: LLM-as-judge

(54:45) Testing your LLM judge against human judgment

(01:00:51) Why evals are the new PRDs for AI products

(01:05:09) How many evals you actually need

(01:07:41) What comes after evals

(01:09:57) The great evals debate

(1:15:15) Why dogfooding isn’t enough for most AI products

(01:18:23) OpenAI’s Statsig acquisition

(1:23:02) The Claude Code controversy and the importance of context

(01:24:13) Common misconceptions around evals

(1:22:28) Tips and tricks for implementing evals effectively

(1:30:37) The time investment

(1:33:38) Overview of their comprehensive evals course

(1:37:57) Lightning round and final thoughts

LLM Log Open Codes Analysis Prompt:

Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.

Referenced:

• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• Mercor: https://mercor.com/

• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b

• Nurture Boss: https://nurtureboss.io/

• Braintrust: https://www.braintrust.dev/

• Andrew Ng on X: https://x.com/andrewyng

• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w

• Julius AI: https://julius.ai/

• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450

• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729

• Statsig: https://statsig.com/

• Claude Code: https://www.anthropic.com/claude-code

• Cursor: https://cursor.com/

• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor

Frozen: https://www.imdb.com/title/tt2294629/

The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire

Recommended books:

Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935

Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/

Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955

Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email [email protected].

Lenny may be an investor in the companies discussed.

My biggest takeaways from this conversation:



To hear more, visit www.lennysnewsletter.com
...more
View all episodesView all episodes
Download on the App Store

Lenny's Podcast: Product | Career | GrowthBy Lenny Rachitsky

  • 4.9
  • 4.9
  • 4.9
  • 4.9
  • 4.9

4.9

1,316 ratings


More shows like Lenny's Podcast: Product | Career | Growth

View all
a16z Podcast by Andreessen Horowitz

a16z Podcast

1,080 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

527 Listeners

The Knowledge Project by Shane Parrish

The Knowledge Project

2,680 Listeners

The Official SaaStr Podcast: SaaS | Founders | Investors by SaaStr

The Official SaaStr Podcast: SaaS | Founders | Investors

173 Listeners

Masters of Scale by WaitWhat

Masters of Scale

3,981 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

221 Listeners

Product Thinking by Melissa Perri

Product Thinking

143 Listeners

The Startup Ideas Podcast by Greg Isenberg

The Startup Ideas Podcast

203 Listeners

The Logan Bartlett Show by by Redpoint Ventures

The Logan Bartlett Show

188 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

130 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

96 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

482 Listeners

Lightcone Podcast by Y Combinator

Lightcone Podcast

17 Listeners

Training Data by Sequoia Capital

Training Data

40 Listeners

Uncapped with Jack Altman by Alt Capital

Uncapped with Jack Altman

41 Listeners