February 02, 2026

Building LLM Agents: Evaluation, Safety, and Tool Use with Georgios Chouliaras

44 minutes

From BERT to Agents: Building Production AI at Booking.com

After seven years building ML systems that serve millions of travelers, Georgios Chouliaras has watched the field transform from hand-coded chatbot rules to autonomous agents—and he's learned which shiny new approaches actually work in production.

Georgios Chouliaras, Senior Machine Learning Scientist at Booking.com, joins me to share hard-won insights from deploying AI at scale. His journey spans customer service chatbots that broke during COVID (because the training data didn't include "global pandemic"), company-wide ML best practices, and now the cutting edge of agent development.

In this episode, we explore:

Why LLMs represent the biggest abstraction leap since high-level programming languages, and what control you sacrifice for that flexibility
The practical framework for deciding when LLMs beat classical ML (hint: it's not always about having text data)
How to build LLM judges that actually work: starting with binary labels, achieving annotator agreement before anything else, and why boundary cases matter most for few-shot examples
What's genuinely unsolved in agents right now, memory as lifelong learning and planning approaches that don't collapse under complexity

Georgios challenges some popular assumptions: the REACT pattern everyone implements? He hasn't seen it consistently outperform simpler approaches. Massive parameter counts? Architecture and training data now matter more. His underhyped pick: straightforward function calling often beats elaborate agent architectures.

The core takeaway: Use the simplest tool that solves your problem. Production users don't care if you're running a sophisticated multi-agent system, they care if it works.

Connect with Georgios:

LinkedIn: https://www.linkedin.com/in/chouligi/

Connect with me:

LinkedIn: https://www.linkedin.com/in/christianbarra/

Check out our awesome sponsor, dearmachines.com, QA AI Agents for Continuous Testing.

...more

View all episodes

By Christian Barra

February 02, 2026

Building LLM Agents: Evaluation, Safety, and Tool Use with Georgios Chouliaras

44 minutes

From BERT to Agents: Building Production AI at Booking.com

In this episode, we explore:

Why LLMs represent the biggest abstraction leap since high-level programming languages, and what control you sacrifice for that flexibility
The practical framework for deciding when LLMs beat classical ML (hint: it's not always about having text data)
How to build LLM judges that actually work: starting with binary labels, achieving annotator agreement before anything else, and why boundary cases matter most for few-shot examples
What's genuinely unsolved in agents right now, memory as lifelong learning and planning approaches that don't collapse under complexity

The core takeaway: Use the simplest tool that solves your problem. Production users don't care if you're running a sophisticated multi-agent system, they care if it works.

Connect with Georgios:

LinkedIn: https://www.linkedin.com/in/chouligi/

Connect with me:

LinkedIn: https://www.linkedin.com/in/christianbarra/

Check out our awesome sponsor, dearmachines.com, QA AI Agents for Continuous Testing.

...more

Share Building LLM Agents: Evaluation, Safety, and Tool Use with Georgios Chouliaras

Sign up to save your podcasts

Building LLM Agents: Evaluation, Safety, and Tool Use with Georgios Chouliaras

Building LLM Agents: Evaluation, Safety, and Tool Use with Georgios Chouliaras