The reliability gap between AI models and production-ready applications is where countless enterprise initiatives die in POC purgatory. In this episode of Chief AI Officer, Doug Safreno, Co-founder & CEO of Gentrace, offers the testing infrastructure that helped customers escape the Whac-A-Mole cycle plaguing AI development. Having experienced this firsthand when building an email assistant with GPT-3 in late 2022, Doug explains why traditional evaluation methods fail with generative AI, where outputs can be wrong in countless ways beyond simple classification errors.
With Gentrace positioned as a "collaborative LLM testing environment" rather than just a visualization layer, Doug shares how they've transformed companies from isolated engineering testing to cross-functional evaluation that increased velocity 40x and enabled successful production launches. His insights from running monthly dinners with bleeding-edge AI engineers reveal how the industry conversation has evolved from basic product questions to sophisticated technical challenges with retrieval and agentic workflows.
Why asking LLMs to grade their own outputs creates circular testing failures, and how giving evaluator models access to reference data or expected outcomes the generating model never saw leads to meaningful quality assessment.How Gentrace's platform enables subject matter experts, product managers, and educators to contribute to evaluation without coding, increasing test velocity by 40x.Why aiming for 100% accuracy is often a red flag, and how to determine the right threshold based on recoverability of errors, stakes of the application, and business model considerations.Testing strategies for multi-step processes where the final output might be an edit to a document rather than text, requiring inspection of entire traces and intermediate decision points.How engineering discussions have shifted from basic form factor questions (chatbot vs. autocomplete) to specific technical challenges in implementing retrieval with LLMs and agentic workflows.How converting user feedback on problematic outputs into automated test criteria creates continuous improvement loops without requiring engineering resources.Using monthly dinners with 10-20 bleeding-edge AI engineers and broader events with 100+ attendees to create learning communities that generate leads while solving real problems.Why 2024 was about getting basic evaluation in place, while 2025 will expose the limitations of simplistic frameworks that don't use "unfair advantages" or collaborative approaches.How to frame AI reliability differently from traditional software while still providing governance, transparency, and trust across organizations.Signs a company is ready for advanced evaluation infrastructure: when playing Whac-A-Mole with fixes, when product managers easily break AI systems despite engineering evals, and when lack of organizational trust is blocking deployment.