
Sign up to save your podcasts
Or


In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.
We talked about:
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Connect with Neurometric:
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith
By NeuroMetric AIIn this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.
We talked about:
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Connect with Neurometric:
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith