The Data Engineering Show

AI Won't Replace Engineers, But This Framework Will Change How They Build with Rohit Girme


Listen Later

Scaling AI from proof-of-concept to production requires more than just deploying models; it demands robust evaluation frameworks, human oversight, and a fundamental shift in how engineering teams approach development.

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Rohit Girme, Staff Software Engineer at Airbnb, to explore how Airbnb built a Gen AI evaluation platform to assess LLM outputs across product surfaces, from customer support bots to search and booking experiences. Rohit shares insights into Airbnb's infrastructure choices, evaluation workflows, and lessons learned about leveraging AI tools while maintaining human orchestration.


What You'll Learn:


- How to architect a multi-layer Gen AI evaluation platform using Python, VLLM, Kubernetes, and DAG-based workflows to systematically test LLM outputs in production

- Why splitting monolithic "virtual judges" into specialized LLM-powered metrics (content relevance, hallucination detection, policy adherence) dramatically improves evaluation accuracy and debugging

- The critical distinction between real-time evaluation (lightweight, sub-second latency) and offline evaluation (comprehensive, human-in-the-loop) and how to route outputs accordingly

- How to shift from traditional software engineering (deterministic, rule-based testing) to probabilistic AI evaluation where you validate outputs against golden datasets and human judgment benchmarks

- The framework for breaking down problems into smaller chunks and using AI tools as collaborators rather than end-to-end problem solvers—critical when working with codebases at massive scale

- Why documentation becomes infrastructure in an AI-driven workflow: LLMs need comprehensive, well-formatted docs to scale tribal knowledge across entire organizations

- The hard truth about AI and scaling: zero-to-one innovation is now commoditized, but one-to-n execution (the scaling part) still demands human judgment, orchestration, and product sense

- How to measure AI tool adoption beyond token usage instrument your development workflow to capture whether LLM suggestions actually made it into shipped code and added real value



About the Guest(s)


Rohit Girme is a Staff Software Engineer at Airbnb, where he has spent the last seven and a half years building infrastructure and platforms at scale. With deep expertise in search and machine learning infrastructure, Rohit leads efforts in GenAI evaluation and has pioneered Airbnb's approach to ensuring AI-powered features work reliably in production. In this episode, Rohit shares practical insights on building evaluation platforms for large language models, orchestrating AI in product workflows, and leveraging AI tools effectively in software development. His work on integrating LLMs into customer-facing products while maintaining quality and performance provides actionable strategies for engineering teams navigating the rapid adoption of AI, making this conversation essential for data engineers and platform builders looking to scale AI responsibly.



Quotes

"Zero to one is easy now, but the one to n, which is a scaling part, I think we still haven't figured that out. You still need humans for that." - Rohit

"With AI, it's a black box to us as well. We don't know how it's working underneath, so we have to figure out another way to evaluate the surface." - Rohit Girme

"Humans should be the orchestrators of these tools and not just hand off everything to these tools." - Rohit Girme

"If we hand off everything to the LLM, it will make a lot of assumptions because context is limited, and it doesn't know the code enough." - Rohit Girme

"Documentation has become even more relevant because now LLMs need to know everything so everyone can scale up." - Rohit Girme

"Measuring productivity in LLMs is not just about how many tokens people are using—you need to figure out if they're actually building something on top." - Rohit Girme

"Internet democratized information, and I think with LLMs, it's capability that would be democratized. If you have a good idea, you can build it very quickly." - Rohit Girme

"There's always going to be blind spots for every person, but with AI, it'll become even faster because you have this very short cycle of talking to the AI instead of talking to five humans." - Rohit Girme

"Shipping products or shipping features would become even faster—where earlier it took weeks or months, now it will be days." - Rohit Girme



"I have supercharged my workflow day to day either at work or at home with access to information that's so easy to get." - Rohit Girme

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here: https://www.fame.so/follow-rate-review 

Resources

LinkedIn Profiles:

  • Rohit Girme's LinkedIn: https://www.linkedin.com/in/rohitgirme/
  • Benjamin's LinkedIn: https://www.linkedin.com/in/wagjamin

Company Websites:

  • Airbnb: airbnb.com
  • Firebolt: firebolt.io

Tools & Platforms:

  • VLLM – Open source inference framework for hosting and running LLM-based inference engines
  • Kubernetes – Container orchestration platform used for serving infrastructure
  • Apache Airflow – DAG-based workflow orchestration tool (originated from Airbnb)
  • GitHub Copilot – AI-powered code completion tool for software development
  • Claude – LLM tool referenced for code generation and development assistance

Cloud Services:

  • Azure – Hosted LLM services used at Airbnb
  • AWS – Hosted LLM services used at Airbnb

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:
  • Zach Wilson on What Makes a Great Data Engineer
  • Joe Reis and Matt Housley on The Fundamentals of Data Engineering
  • Bill Inmon, The Godfather of Data Warehousing
...more
View all episodesView all episodes
Download on the App Store

The Data Engineering ShowBy The Firebolt Data Bros

  • 3.8
  • 3.8
  • 3.8
  • 3.8
  • 3.8

3.8

8 ratings


More shows like The Data Engineering Show

View all
Planet Money by NPR

Planet Money

30,609 Listeners

Hidden Brain by Hidden Brain, Shankar Vedantam

Hidden Brain

43,687 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

149 Listeners

DataFramed by DataCamp

DataFramed

266 Listeners

Tech Brew Ride Home by Morning Brew

Tech Brew Ride Home

964 Listeners

Practical AI by Practical AI LLC

Practical AI

212 Listeners

The Journal. by The Wall Street Journal & Spotify Studios

The Journal.

6,097 Listeners

My First Million by Hubspot Media

My First Million

2,660 Listeners

The Prof G Pod with Scott Galloway by Vox Media Podcast Network

The Prof G Pod with Scott Galloway

5,610 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

140 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,254 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

551 Listeners

The Analytics Engineering Podcast by dbt Labs, Inc.

The Analytics Engineering Podcast

29 Listeners

HBR On Leadership by Harvard Business Review

HBR On Leadership

170 Listeners

Training Data by Sequoia Capital

Training Data

39 Listeners