What if the biggest mistake in AI development isn’t what you’re building—but how you’re measuring it?In this episode of ABCs for Building the Future, host Robert sits down with Hamel Husain — machine learning engineer with 25+ years across GitHub, Airbnb, and his own consulting practice, and creator of the AI Evals course that’s trained nearly 50 OpenAI employees.
Hamel shares his evolution from pioneering code understanding models at GitHub to breaking free from corporate America — and makes a compelling case that we’re measuring AI products all wrong. We don’t need more automated eval vendors or hallucination scores. We need teams who know how to look at their data, count their failures, and iterate like investigative journalists.
If you’re a founder, engineer, or product leader drowning in eval tooling demos and wondering why your AI product still feels broken — this conversation is a masterclass in cutting through the hype, escaping the matrix of engineering elitism, and building products that actually work.
1. Error Analysis Before Automation: The 30-Minute Practice That Beats Every Vendor Tool
“The first step in doing eval is first understand what is wrong with your system. So doing some data analysis of your system to figure out what is broken.”
Hamel’s career-defining insight didn’t come from a research paper or a vendor pitch. It came from watching client after client get stuck in the same place: building AI products that “work” but don’t work well.
The pattern was identical every time. Teams would glue together RAG pipelines, add tool calling, get excited about the demo—then hit a wall. “How do we make it actually good?” they’d ask. And Hamel would ask back: “How are you measuring what’s improving?”
The breakthrough: Most teams jump straight to evals (or worse, eval vendors) without understanding their actual failure modes. They’re solving the wrong problem.
What actually works:
* Pull 100 traces from your production system
* Take notes on what breaks (be specific: “interrupts user mid-thought” not “bad UX”)
* Categorize your notes into failure types
* Count them
That’s it. No LLM judges, no automated hallucination scores, no vendor dashboards.
Reflection: Hamel shared a story about auditing a recruiting email platform. The AI-generated messages were generic LinkedIn spam: “Based on your background at Epistemic Me...” When he pointed out he’d immediately delete these, the team said “everything works.” But they weren’t actually trying to recruit anyone. They weren’t measuring conversion. They had convinced themselves the product worked because they hadn’t looked at the data.
If you’ve ever felt like your AI product is “almost there” but can’t articulate why it’s not—this 30-minute practice will tell you more than three months of tooling evaluation.
2. The Dogfooding Delusion: Why Most Teams Are Just Testing (Not Using)
“Are you dog fooding on that level? Where you are the expert. You are using it in anger all the time... Its output affects your livelihood.”
When Anthropic revealed that Claude Code’s success came partly from intensive internal use, Twitter exploded: “See, you don’t need evals!” But Hamel caught what everyone missed—a crucial distinction between real dogfooding and performance theater.
Real dogfooding (Anthropic engineers with Claude Code):
* They ship production code using it daily
* When it breaks, their work stops
* The output affects their livelihood
* They’re both the domain expert AND the user
Fake dogfooding (most teams):
* “Using” the product occasionally to “test it”
* Not actually trying to achieve the goal
* Wouldn’t pay for it themselves
* Building an AI health coach but not trying to lose weight with it
The gap is everything. In fake dogfooding, you convince yourself the product works because you’re not feeling the pain of it failing.
Reflection: Hamel’s litmus test is disarmingly simple: Would you pay for this with your own money? Does using it save or cost you real time? If not, you’re not dogfooding—you’re testing. And that’s a fundamentally different feedback loop that lets broken products feel functional far longer than they should.
The dangerous corollary: Everyone thinks they’re dogfooding. The teams building recruiting tools, fitness coaches, coding assistants—they all believe they’re using their products “in anger.” But ask them about conversion rates, about whether they’d recommend it to a friend, about whether it’s actually solving their problem—and the story changes.
3. Data Science as Investigative Journalism: The Mindset That Changes Everything
“Good data science in big businesses looks like investigative journalism. You’re figuring out, teasing out the story of what exactly happened here and what can I learn.”
This reframe completely shifted how I think about AI product work. It’s not about running queries or computing metrics—it’s about investigation.
The parallel:
Investigative journalist:
* Conducts interviews (qualitative insight)
* Analyzes public records and data (quantitative patterns)
* Connects dots others miss
* Synthesizes into a narrative that drives action
Effective AI product developer:
* Examines traces and user conversations (qualitative)
* Measures failure patterns and metrics (quantitative)
* Identifies root causes, not symptoms
* Communicates what’s broken and why in ways that mobilize teams
Why this matters: Most engineers are trained to dismiss storytelling as “soft skills” or “non-technical fluff.” But when you’re working with stochastic systems that can’t be reduced to deterministic tests, your ability to investigate, synthesize, and communicate becomes the core competency.
You can’t fix what you can’t explain. And you can’t explain what you haven’t investigated.
Reflection: Hamel describes how at companies experiencing margin leakage or churn spikes, the best data scientists operate exactly like reporters: they gather evidence from multiple sources (qualitative interviews, quantitative metrics), look for patterns, rule out false leads, and ultimately tell a coherent story about what happened and why it matters.
The person who can explain why your AI health coach sounds paternalistic is worth infinitely more than someone who can compute its “tone score.”
If you’re hiring for AI product roles, look for people who can tell stories with data—not just run statistical analyses.
4. Silicon Valley’s Status Games: Why Engineers Defend the Matrix (And How to Escape)
“Engineers will defend the matrix. They will defend the matrix to the death... You kind of have to decide, like you have to see that. If you want to get outside the matrix.”
This might be the most uncomfortable truth Hamel shared and the most liberating.
The Silicon Valley hierarchy (unspoken but real):
* VC-backed unicorn founder (peak status)
* Bootstrapped product founder
* Senior engineer at FAANG
* Course creator / educator (low status, sometimes openly mocked as “course bro”)
The absurd reality: Hamel’s first course on fine-tuning LLMs generated 5x his consulting income while he slept. More critically, it gave him leverage and freedom—the very things most engineers claim to want but are trained to dismiss.
“I would go to sleep and then wake up in the morning and be like, all these sales while I was sleeping. That totally rewired my brain. I can never go back to a job.”
Why this matters: Engineers are systematically taught to mock anything “non-technical”—marketing, communication, teaching, writing. If you’re good at explaining concepts or building community, you’re dismissed as “not a real engineer” or “the marketing guy.”
But these are exactly the skills that enable you to escape trading time for money.
The trap Hamel identifies: At companies, if you’re a strong developer who’s also good at writing, communication, or devrel—you’ll often get lower pay, hit career ceilings faster, and face subtle elitism from peers. The message is clear: spike on coding only, or you’re not serious.
But if you want to run your own business? Those “soft skills” are suddenly the highest leverage activities you can do.
Reflection: Hamel traces this back to his consulting days at Accenture, where consultants would laugh at their clients over dinner: “Can you believe they don’t know what they’re doing?” It’s a different matrix—the consulting matrix—where you’re convinced you could never work at “those companies.”
He only broke free after going to law school (which he hated), resetting his brain, and realizing: there are multiple matrices. And most engineers defend theirs without even seeing it.
The litmus test: If you’re good at writing, teaching, or explaining—that’s not weakness. The question isn’t whether it has “social status” in your team’s Slack. The question is: does it give you freedom?
5. When Evals Actually Matter (And When They’re Just Theater)
“You shouldn’t do evals unless you’re getting some value out of it. If you do an eval you get some immediate value out of it. Otherwise you shouldn’t do it.”
After all the discussion about what evals aren’t, Hamel is crystal clear about when they become genuinely valuable—and when they’re just performance.
Use evals when:
* You’ve identified a specific, recurring failure through error analysis
* The fix isn’t obvious (you’ll need to iterate)
* You have enough examples to validate your measurement approach
* The eval provides a signal you trust to guide rapid iteration
Skip evals when:
* You found a simple bug you can just fix (wrong tool in API call, syntax error)
* You’re trying to automate away understanding your product
* You’re chasing generic metrics without product context (hallucination scores, toxicity)
* You haven’t looked at your actual data yet
The spectrum of eval cost:
Code-based evals (cheap):
* Does the output contain a user ID?
* Are code blocks properly formatted?
* Uses regex or simple assertions
* Runs like a unit test
LLM-as-judge evals (expensive):
* Is the tone appropriate?
* Is advice personalized enough?
* Requires making an LLM call
* Requires evaluating the evaluator (meta-eval)
The critical nuance: If you’re going to use an LLM to judge something, you have to know if it’s judging correctly. This means labeling examples, checking agreement rates, iterating on the judge prompt. Even your evals need evals.
Reflection: The discourse around evals has become tribal: “Evals are everything!” vs “Evals are dead!” Both are wrong. Hamel’s position is pragmatic: evals are a tool. Use them when they accelerate product improvement. Skip them when they don’t.
The real enemy isn’t evals or no-evals—it’s “eval theater.” Teams running automated vendor tools to generate scores no one understands, trusts, or acts on. Dashboards that look impressive in investor decks but don’t change what gets shipped.
What matters: Are you getting better, faster? That’s the only eval that counts.
🎧 Resources and Further Listening
Connect with Hamel:
* Follow Hamel on LinkedIn
* Follow Hamel on X/Twitter
* Hamel on YouTube
* Hamel’s Substack
Tools & Platforms Mentioned:
* Get 35% off Hamel’s LLM Evals Course – Next cohort starts October 6th.
* Code Search Net – Hamel’s pioneering work on code understanding at GitHub, creating a benchmark used by early coding models like the original Codex
* Delphi AI – The platform Hamel uses to create AI assistants (course students get exclusive access to one trained on all evals materials)
Referenced Ideas:
* Peter Thiel’s Competition Philosophy – “Competition is for losers” - discussed in the context of avoiding crowded markets like evals tooling
Get full access to ABCs for Building The Future at abcsforbuildingthefuture.substack.com/subscribe