Inference Time Tactics

Beyond Vibe Testing: Smarter Eval for Agentic AI


Listen Later

In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.

 

We talked about:

 

  • Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliability.
  • How leading models perform inconsistently across single-turn and multi-turn enterprise tasks.
  • Why benchmark scores are weak predictors of operational success in production.
  • The role of inference-time tactics in reducing variance and improving stability.
  • NeuroMetric’s new platform: ITC Test Engine and drag-and-drop interface for experimentation.
  • Challenges in building agentic systems, from database integration to managing multi-prompt complexity.
  • Why large language models’ stochastic nature conflicts with business demands for reliability.
  • Latency, cost, and rate limits as major bottlenecks in scaling agentic workflows.
  • The limits of “vibe testing” and why rigorous evaluation frameworks are essential.
  • How Google’s Stacks tool speeds up evaluation with LLM-as-judge, and why it still falls short for enterprise needs.


  • Resources Mentioned:

    CRMArena-Pro from Saleforce:

    https://www.salesforce.com/blog/crmarena-pro/  

     

    Connect with Neurometric:

    Website: https://www.neurometric.ai/ 

    Substack: https://neurometric.substack.com/ 

    X: https://x.com/neurometric/ 

    Bluesky: https://bsky.app/profile/neurometric.bsky.social

     

    Hosts:

    Rob May

    https://x.com/robmay 

    https://www.linkedin.com/in/robmay

     

    Calvin Cooper

    https://x.com/cooper_nyc_ 

    https://www.linkedin.com/in/coopernyc

     

    Guest/s:

    Byron Galbraith

    https://x.com/bgalbraith 

    https://www.linkedin.com/in/byrongalbraith

    ...more
    View all episodesView all episodes
    Download on the App Store

    Inference Time TacticsBy NeuroMetric AI