The Growth Podcast

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar


Listen Later

Today’s Episode

Everyone’s demoing AI features. Few are shipping them to production reliably.

The gap? Evals.

Not the theoretical kind. The real-world kind that catches bugs before users do.

Hamel Husain and Shreya Shankar train people at OpenAI, Anthropic, Google, and Meta on how to build AI products that actually work. Their Maven course is the top-grossing course on the platform.

Today, they’re walking you through their complete eval process.

----

Brought to you by:

* The AI Evals Course for PMs & Engineers: You get $800 with this link

* Vanta: Automate compliance, Get $1,000 with my link

* Jira Product Discovery: Plan with purpose, ship with confidence

* Land PM job: 12-week experience to master getting a PM job

* Pendo: the #1 Software Experience Management Platform

----

If you want access to my AI tool stack - Dovetail, Arize, Linear, Descript, Reforge Build, DeepSky, Relay.app, Magic Patterns, and Mobbin - for free, grab Aakash’s bundle.

Are you searching for a PM job? Join me + 29 others for an intensive 12-week experience to master getting a PM job. Only 23 seats left.

----

Key Takeaways:

1. AI evals are the #1 most important new skill for PMs in 2025 - Even Claude Code teams do evals upstream. For custom applications, systematic evaluation is non-negotiable. Dog fooding alone isn't enough at scale.

2. Error analysis is the secret weapon most teams skip - Looking at 100 traces teaches you more than any generic metric. Hamel: "If you try to use helpfulness scores, the LLM won't catch the real product issues."

3. Use observability tools but don't depend on them completely - Brain Trust, LangSmith, Arise all work. But Shreya and Hamel teach students to vibe code their own trace viewers. Sometimes CSV files are enough to start.

4. Never use agreement as your eval metric - It's a trap. A judge that always says "pass" can have 90% accuracy if failures are rare. Use TPR (true positive rate) and TNR (true negative rate) instead.

5. Open coding then axial coding reveals patterns - Write notes on 100 traces without root cause analysis. Then categorize into 5-6 actionable themes. Use LLMs to help but refine manually.

6. Product managers must do the error analysis themselves - Don't outsource to developers. Engineers lack domain context. Hamel: "It's almost a tragedy to separate the prompt from the product manager because it's English."

7. Real traces reveal what demos hide - Chat GPT said the assistant was correct but missed: wrong bathroom configuration, markdown in SMS, double-booked tours, ignored handoff requests.

8. Binary scores beat 1-5 scales for LLM judges - Easier to validate alignment. Business decisions are binary anyway. LLMs struggle with nuanced numerical scoring.

9. Code-based evals for formatting, LLM judges for subjective calls - Markdown in text messages? Write a simple assertion. Human handoff quality? Need an LLM judge with proper rubric.

10. Start with traces even before launch - Dog food your own app. Recruit friends as beta testers. Generate synthetic inputs only as last resort. Error analysis works best with real user behavior.-

---

Where To Find Them

* LinkedIn:

* Hamel: Hamel’s LinkedIn

* Shreya: Shreya’s LinkedIn

* AI Evals Course: World’s best AI Evals Course (You get $800 off with this link)

Related Content

Newsletters:

* AI Evals

* AI PM Observability

* AI Testing

* LLM Judge

* AI Prototype to Production

* AI Product Strategy

* How to Build AI Products

* Prompt Engineering

Podcasts:

* AI Evals: Everything You Need to Know to Start

* Everything you need to know about AI (for PMs and builders)

* Carl Vellotti on Claude Code

* Marily Nika on Google AI PM Tool Stack

* Pawel Huryn on n8n for PMs

PS. Please subscribe on YouTube and follow on Apple & Spotify. It helps!



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.news.aakashg.com/subscribe
...more
View all episodesView all episodes
Download on the App Store

The Growth PodcastBy Aakash Gupta

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

34 ratings


More shows like The Growth Podcast

View all
The Tim Ferriss Show by Tim Ferriss: Bestselling Author, Human Guinea Pig

The Tim Ferriss Show

16,209 Listeners

This Week in Startups by Jason Calacanis

This Week in Startups

1,290 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

532 Listeners

The Knowledge Project by Shane Parrish

The Knowledge Project

2,710 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,095 Listeners

Cold Call by HBR Presents / Brian Kenny

Cold Call

196 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

346 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

226 Listeners

Design Better by The Curiosity Department, sponsored by Wix Studio

Design Better

319 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,264 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Startup Ideas Podcast by Greg Isenberg

The Startup Ideas Podcast

209 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

600 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

99 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

639 Listeners