July 08, 2025

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

Listen Later

44 minutes

Demos are easy; durability is hard. Samuel Colvin has spent a decade building guardrails in Python (first with Pydantic, now with Logfire), and he’s convinced most LLM failures have nothing to do with the model itself. They appear where the data is fuzzy, the prompts drift, or no one bothered to measure real-world behavior. Samuel joins me to show how a sprinkle of engineering discipline keeps those failures from ever reaching users.

We talk through:

• Tiny labels, big leverage: how five thumbs-ups/thumbs-downs are enough for Logfire to build a rubric that scores every call in real time

• Drift alarms, not dashboards: catching the moment your prompt or data shifts instead of reading charts after the fact

• Prompt self-repair: a prototype agent that rewrites its own system prompt—and tells you when it still doesn’t have what it needs

• The hidden cost curve: why the last 15 percent of reliability costs far more than the flashy 85 percent demo

• Business-first metrics: shipping features that meet real goals instead of chasing another decimal point of “accuracy”

If you’re past the proof-of-concept stage and staring down the “now it has to work” cliff, this episode is your climbing guide.

LINKS

Pydantic

Logfire

Upcoming Events on Luma

Hugo's recent newsletter about upcoming events and more!

🎓 Learn more:

Hugo's course: Building LLM Applications for Data Scientists and Software Engineers — next cohort starts July 8: https://maven.com/s/course/d56067f338

📺 Watch the video version on YouTube: YouTube link

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

Vanishing Gradients

By Hugo Bowne-Anderson

5

1111 ratings

July 08, 2025

Episode 53: Human-Seeded Evals & Self-Tuning Agents: Samuel Colvin on Shipping Reliable LLMs

Listen Later

44 minutes

Demos are easy; durability is hard. Samuel Colvin has spent a decade building guardrails in Python (first with Pydantic, now with Logfire), and he’s convinced most LLM failures have nothing to do with the model itself. They appear where the data is fuzzy, the prompts drift, or no one bothered to measure real-world behavior. Samuel joins me to show how a sprinkle of engineering discipline keeps those failures from ever reaching users.

We talk through:

• Tiny labels, big leverage: how five thumbs-ups/thumbs-downs are enough for Logfire to build a rubric that scores every call in real time

• Drift alarms, not dashboards: catching the moment your prompt or data shifts instead of reading charts after the fact

• Prompt self-repair: a prototype agent that rewrites its own system prompt—and tells you when it still doesn’t have what it needs

• The hidden cost curve: why the last 15 percent of reliability costs far more than the flashy 85 percent demo

• Business-first metrics: shipping features that meet real goals instead of chasing another decimal point of “accuracy”

If you’re past the proof-of-concept stage and staring down the “now it has to work” cliff, this episode is your climbing guide.

LINKS

Pydantic

Logfire

Upcoming Events on Luma

Hugo's recent newsletter about upcoming events and more!

🎓 Learn more:

Hugo's course: Building LLM Applications for Data Scientists and Software Engineers — next cohort starts July 8: https://maven.com/s/course/d56067f338

📺 Watch the video version on YouTube: YouTube link

...more

More shows like Vanishing Gradients

a16z Podcast by Andreessen Horowitz

a16z Podcast

1,033 Listeners

Data Skeptic by Kyle Polich

Data Skeptic

480 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

441 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

298 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

331 Listeners

DataFramed by DataCamp

DataFramed

267 Listeners

Practical AI by Practical AI LLC

Practical AI

192 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

198 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

409 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

AI + a16z by a16z

AI + a16z

31 Listeners

High Signal: Data Science | Career | AI by Delphina

High Signal: Data Science | Career | AI

4 Listeners

OpenAI Podcast by OpenAI

OpenAI Podcast

28 Listeners