The AWS Developers Podcast

Why Your Agent Evaluations Will Fail You (and How to Fix Them Before Production)


Listen Later

Anthropic deprecated Sonnet 3.5. Some of Xelix's pipelines migrated smoothly. Others broke — and customers noticed within hours. What separated the two? Evaluation. Paul Solomon and James Price Farr have spent 5+ years building AI systems that process millions of invoices for enterprise customers. In this episode, they share the evaluation-first framework that now saves them every time a model changes, an orchestration layer fails, or an agent picks the wrong tool. Key takeaways: • Evaluation-first, not evaluation-after — Retrofitting evaluation on an agent already in production is painful. Build your eval pipeline before you build the agent. • Monitor tool calls, not just outputs — If the agent isn't selecting the right tools, nothing downstream will be correct. Tool-call monitoring is your leading indicator. • 3 tiers of automation — Not everything needs an agent. Rules-based → single LLM call → agentic system. Pick the simplest tier that solves the problem. • Extended thinking tames token explosion — After migrating to newer, more verbose models, enabling extended thinking (with a budget) moved reasoning out of expensive output tokens and brought costs back under control. • Human-in-the-loop by default — Start with human review on every output, then earn trust toward touchless automation as customers gain confidence. • Pragmatism wins — Use whatever technology works best for the problem. Not every feature needs an LLM. Recorded live at AWS Summit London.

With Paul Solomon, Head of AI Engineering at Xelix ; With James Price Farr, AI Engineering Team Lead at Xelix

    • Xelix — AI-Powered Accounts Payable Platform
      Strands Agents SDK — Open Source
      Amazon Bedrock — Managed LLM Inference
      Amazon Bedrock AgentCore
      Strands Agents — Steering Files and Hooks for Agent Accuracy (Claire Liguori)
      Amazon SageMaker
      Fast.ai — Practical Deep Learning Courses (Book Recommendation)
      The Fifth Risk — Michael Lewis (Book Recommendation)
      Neurosymbolic AI and Automated Reasoning on AWS
      Kiro — AI-Powered Development Environment
  • ...more
    View all episodesView all episodes
    Download on the App Store

    The AWS Developers PodcastBy Amazon Web Services

    • 4.7
    • 4.7
    • 4.7
    • 4.7
    • 4.7

    4.7

    24 ratings


    More shows like The AWS Developers Podcast

    View all
    The Daily by The New York Times

    The Daily

    112,191 Listeners

    Practical AI by Practical AI LLC

    Practical AI

    213 Listeners

    AWS Podcast by Amazon Web Services

    AWS Podcast

    204 Listeners

    Le podcast đŸŽ™ïž AWS ☁ en đŸ‡«đŸ‡· by Amazon Web Services

    Le podcast đŸŽ™ïž AWS ☁ en đŸ‡«đŸ‡·

    0 Listeners