
Sign up to save your podcasts
Or


This episode is sponsored by AGNTCY. Unlock agents at scale with an open Internet of Agents.
Visit https://agntcy.org/ and add your support.
How do the world's most powerful AI models get trained and trusted at scale, and what does that really take from data to deployment?
In this episode, Appen CEO Ryan Kolln joins Eye on AI to unpack how rigorous human evaluation, culturally aware data, and model-based judges come together to raise real-world performance.
In this episode of Eye on AI, host Craig Smith speaks with Ryan Kolln, CEO of Appen, about building evaluation systems that go beyond static benchmarks to measure usefulness, safety, and reliability in production. They explore how human raters and AI evaluators work in tandem, why localization matters across regions and domains, and how quality controls keep feedback signals trustworthy for training and post-training.
Ryan explains how evaluation feeds reinforcement strategies, where rubric-driven human judgments inform reward models, and how enterprises can stand up secure workflows for sensitive use cases. He also discusses emerging needs around sovereign models, domain-specific testing, and the shift from general chat to agentic workflows that operate inside real business systems.
Learn how leading teams design human-in-the-loop evaluation, when to route judgments from models back to expert reviewers, how to capture cultural nuance without losing universal guardrails, and how to build an evaluation stack that scales from early prototypes to production AI.
Stay Updated: Craig Smith on X: https://x.com/craigss Eye on A.I. on X: https://x.com/EyeOn_AI
By Craig S. Smith4.7
5555 ratings
This episode is sponsored by AGNTCY. Unlock agents at scale with an open Internet of Agents.
Visit https://agntcy.org/ and add your support.
How do the world's most powerful AI models get trained and trusted at scale, and what does that really take from data to deployment?
In this episode, Appen CEO Ryan Kolln joins Eye on AI to unpack how rigorous human evaluation, culturally aware data, and model-based judges come together to raise real-world performance.
In this episode of Eye on AI, host Craig Smith speaks with Ryan Kolln, CEO of Appen, about building evaluation systems that go beyond static benchmarks to measure usefulness, safety, and reliability in production. They explore how human raters and AI evaluators work in tandem, why localization matters across regions and domains, and how quality controls keep feedback signals trustworthy for training and post-training.
Ryan explains how evaluation feeds reinforcement strategies, where rubric-driven human judgments inform reward models, and how enterprises can stand up secure workflows for sensitive use cases. He also discusses emerging needs around sovereign models, domain-specific testing, and the shift from general chat to agentic workflows that operate inside real business systems.
Learn how leading teams design human-in-the-loop evaluation, when to route judgments from models back to expert reviewers, how to capture cultural nuance without losing universal guardrails, and how to build an evaluation stack that scales from early prototypes to production AI.
Stay Updated: Craig Smith on X: https://x.com/craigss Eye on A.I. on X: https://x.com/EyeOn_AI

478 Listeners

174 Listeners

341 Listeners

154 Listeners

213 Listeners

90 Listeners

131 Listeners

95 Listeners

155 Listeners

209 Listeners

591 Listeners

268 Listeners

26 Listeners

35 Listeners

39 Listeners