
Sign up to save your podcasts
Or


Izzy Miller is an AI engineer at Hex, an AI analytics platform that was one of the first companies to ship data agents to real paying users. Today, Hex runs a multi-agent system with nearly 100K tokens of tools, and Izzy is building a 90-day simulation to evaluate whether those agents actually get smarter over time. In this conversation, he walks through the harness decisions that shaped their architecture, the failure modes Hex is seeing at scale, and what it takes to build an eval that no current model can pass.
We also discuss:
References:
Where to find Izzy:
Where to find Harrison:
Where to find LangChain:
Send feedback or questions to [email protected]
Timestamps:
01:35 Where Hex's notebook agent started
03:46 The moment Hex knew it was time for agents
07:36 Why data agents are harder to verify than coding agents
09:30 How Hex is unifying separate agents
13:28 Under the hood of the notebook agent
15:41 The harness features that are now holding the agent back
17:41 Why Hex built their own orchestrator
18:59 Managing nearly 100K tokens of tools
20:49 Ephemeral queries and agent behavior trade-offs
24:46 The UX problem with showing agents' thinking
27:28 Why verification is harder than transparency for data agents
31:00 Memory, context conflicts, and collapse modes
34:38 How Hex built their internal eval system
39:29 Why most eval sets are bad
44:30 The 900% quota eval that every model fails
46:55 Model upgrades and the "in distribution" debate
51:34 How Izzy went from marketer to AI engineer
59:59 The 90-day simulation for long-horizon evals
By LangChainIzzy Miller is an AI engineer at Hex, an AI analytics platform that was one of the first companies to ship data agents to real paying users. Today, Hex runs a multi-agent system with nearly 100K tokens of tools, and Izzy is building a 90-day simulation to evaluate whether those agents actually get smarter over time. In this conversation, he walks through the harness decisions that shaped their architecture, the failure modes Hex is seeing at scale, and what it takes to build an eval that no current model can pass.
We also discuss:
References:
Where to find Izzy:
Where to find Harrison:
Where to find LangChain:
Send feedback or questions to [email protected]
Timestamps:
01:35 Where Hex's notebook agent started
03:46 The moment Hex knew it was time for agents
07:36 Why data agents are harder to verify than coding agents
09:30 How Hex is unifying separate agents
13:28 Under the hood of the notebook agent
15:41 The harness features that are now holding the agent back
17:41 Why Hex built their own orchestrator
18:59 Managing nearly 100K tokens of tools
20:49 Ephemeral queries and agent behavior trade-offs
24:46 The UX problem with showing agents' thinking
27:28 Why verification is harder than transparency for data agents
31:00 Memory, context conflicts, and collapse modes
34:38 How Hex built their internal eval system
39:29 Why most eval sets are bad
44:30 The 900% quota eval that every model fails
46:55 Model upgrades and the "in distribution" debate
51:34 How Izzy went from marketer to AI engineer
59:59 The 90-day simulation for long-horizon evals