Max Agency

How Hex builds AI agents that reason like human data analysts | Izzy Miller, AI Engineer


Listen Later

Izzy Miller is an AI engineer at Hex, an AI analytics platform that was one of the first companies to ship data agents to real paying users. Today, Hex runs a multi-agent system with nearly 100K tokens of tools, and Izzy is building a 90-day simulation to evaluate whether those agents actually get smarter over time. In this conversation, he walks through the harness decisions that shaped their architecture, the failure modes Hex is seeing at scale, and what it takes to build an eval that no current model can pass.


We also discuss:

  • Why data agents are harder to verify than coding agents
  • Under the hood of Hex’s agents
  • How Hex is unifying separate agents
  • Why most eval sets are bad
  • The 90-day simulation for long-horizon evals
  • How Izzy went from marketing to AI engineer


References:

  • Andon Labs
  • Anthropic
  • Barry McCardel
  • ChatGPT
  • Claude Code
  • Claude Sonnet 4.6
  • DBT
  • GPT-3.5 Turbo
  • GPT-5.3 Codex Spark
  • GPT-5.4
  • Hex
  • LangChain
  • LangSmith
  • Looker
  • OpenAI
  • Opus 4.6
  • Satya Nadella
  • Snowflake
  • Vending Machine


Where to find Izzy:

  • LinkedIn
  • Twitter/X


Where to find Harrison:

  • LinkedIn
  • Twitter/X


Where to find LangChain:

  • Website
  • Docs


Send feedback or questions to [email protected]


Timestamps:

01:35 Where Hex's notebook agent started

03:46 The moment Hex knew it was time for agents

07:36 Why data agents are harder to verify than coding agents

09:30 How Hex is unifying separate agents

13:28 Under the hood of the notebook agent

15:41 The harness features that are now holding the agent back

17:41 Why Hex built their own orchestrator

18:59 Managing nearly 100K tokens of tools

20:49 Ephemeral queries and agent behavior trade-offs

24:46 The UX problem with showing agents' thinking

27:28 Why verification is harder than transparency for data agents

31:00 Memory, context conflicts, and collapse modes

34:38 How Hex built their internal eval system

39:29 Why most eval sets are bad

44:30 The 900% quota eval that every model fails

46:55 Model upgrades and the "in distribution" debate

51:34 How Izzy went from marketer to AI engineer

59:59 The 90-day simulation for long-horizon evals

...more
View all episodesView all episodes
Download on the App Store

Max AgencyBy LangChain