
Sign up to save your podcasts
Or


If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com
In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.
After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.
In this episode, you’ll learn:
• Why most AI eval dashboards fail to deliver real product improvements
• How to use error analysis to uncover your product’s most critical failure modes
• The role of a “principal domain expert” in setting a consistent quality bar
• Techniques for transforming messy error notes into a clean taxonomy of failures
• When to use code-based checks vs. LLM-as-a-judge evaluators
• How to build trust in your evals with human-labeled ground-truth datasets
• Why binary pass/fail labels outperform Likert scales in practice
• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows
• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement
References:
• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals
• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/
• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272
• Aman Khan: https://www.linkedin.com/in/amanberkeley/
• Anthropic: https://www.anthropic.com/
• Arize Phoenix: https://phoenix.arize.com/
• Braintrust: https://www.braintrust.dev/
• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/
• Hamel Husain: https://www.linkedin.com/in/hamelhusain/
• LangSmith: https://smith.langchain.com/
• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html
• OpenAI: https://openai.com/
• Shreya Shankar: https://www.linkedin.com/in/shrshnk/
Listen:
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Newsletter: https://www.lennysnewsletter.com/subscribe
Follow Lenny:
• Twitter/X: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
Subscribe
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Substack: https://lennysreads.com/
Follow Lenny
• Twitter: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
About
Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.
By Lenny Rachitsky4.3
66 ratings
If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com
In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.
After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.
In this episode, you’ll learn:
• Why most AI eval dashboards fail to deliver real product improvements
• How to use error analysis to uncover your product’s most critical failure modes
• The role of a “principal domain expert” in setting a consistent quality bar
• Techniques for transforming messy error notes into a clean taxonomy of failures
• When to use code-based checks vs. LLM-as-a-judge evaluators
• How to build trust in your evals with human-labeled ground-truth datasets
• Why binary pass/fail labels outperform Likert scales in practice
• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows
• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement
References:
• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals
• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/
• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272
• Aman Khan: https://www.linkedin.com/in/amanberkeley/
• Anthropic: https://www.anthropic.com/
• Arize Phoenix: https://phoenix.arize.com/
• Braintrust: https://www.braintrust.dev/
• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/
• Hamel Husain: https://www.linkedin.com/in/hamelhusain/
• LangSmith: https://smith.langchain.com/
• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html
• OpenAI: https://openai.com/
• Shreya Shankar: https://www.linkedin.com/in/shrshnk/
Listen:
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Newsletter: https://www.lennysnewsletter.com/subscribe
Follow Lenny:
• Twitter/X: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
Subscribe
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Substack: https://lennysreads.com/
Follow Lenny
• Twitter: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@lennyspodcast
About
Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.

30,609 Listeners

9,724 Listeners

2,221 Listeners

343 Listeners

113,121 Listeners

233 Listeners

2,660 Listeners

10,254 Listeners

551 Listeners

5,576 Listeners

678 Listeners

150 Listeners

688 Listeners

55 Listeners

158 Listeners