The Data Engineering Show

Llama 2 & 3 Safety: Soumya Batra on Agentic AI Training


Listen Later

In this episode of The Data Engineering Show, host Benjamin Wagner sits down with Soumya Batra, founder and CEO of WisePort AI and former tech lead at Meta where she led safety efforts for Llama 2 and Llama 3, to explore the evolution of NLP, the complete lifecycle of foundation model training, and why the next AI frontier lies in natively agentic systems rather than simply scaling larger transformers.


What You'll Learn:

  • Why historical NLP work becomes obsolete with each paradigm shift: Understand how Bayesian networks, RNNs, and LSTMs each dominated until replaced - and why current transformer-scaling dogma will likely face the same fate
  • How to structure the foundation model training lifecycle for safety: Learn the three critical phases - pretraining (data mix optimization), supervised fine-tuning (instruction alignment), and reinforcement learning (human preference integration)—and where safety interventions deliver maximum leverage
  • The counterintuitive data strategy for pretraining safety: Discover why removing all toxic content actually weakens model robustness, and how maintaining a precise balance preserves the model's ability to classify and refuse harmful requests
  • How dual reward models maximize both helpfulness and safety: See why combining helpfulness and safety objectives (as done in Llama 3) ensures every training sample reinforces both capabilities simultaneously rather than creating trade-offs
  • What "natively agentic" means and why it matters more than LLM-powered agents: Learn how foundational agentic models dynamically explore action spaces at inference time instead of relying on fixed developer-defined scaffolding, unlocking domain-agnostic workflows
  • How to build a foundational AI startup without massive training datasets: Understand why synthetic data generation, deterministic task validation, and deep domain expertise can substitute for Internet-scale language corpora in the agentic space

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.

About the Guest(s)

Soumya Batra is the Founder and CEO of WisePort AI, a foundational AI company specializing in agentic AI systems. With over twelve years of expertise in NLP and machine learning, she previously served as a Tech Lead and Applied Research Scientist at Meta, where she led safety and controllability efforts for both Llama 2 and Llama 3. Her career spans foundational work at Carnegie Mellon University, Microsoft, and Meta, establishing her as a pioneering voice in conversational AI and foundation model development. In this episode, Soumya demystifies the journey from traditional NLP to large language models, revealing how safety and controllability are embedded across the entire model lifecycle—from pretraining through reinforcement learning. Her insights on the future of agentic AI and the limitations of current scaling-only approaches provide essential perspective for data engineers and ML practitioners navigating the rapidly evolving AI landscape.


Quotes
"I did not know then that this would become my career for the next decade." - Soumya

"Whatever work that I've done in the past becomes irrelevant all of a sudden." - Soumya

"There is always a notion of, yes, this is the big thing, and then no, it's not anymore." - Soumya

"I really think that we are going to be proven wrong once again about scaling transformers being the only way to achieve general intelligence." - Soumya

"Safety was an issue even back then, even though we were training in such controlled settings." - Soumya

"If you don't put some toxic content there, then it will lose the ability to classify it and it'll be much easier to break the safety later on." - Soumya

"In the post training phase, we are giving it that ability to be able to answer users' questions." - Soumya

"The next unlock will now come from foundational agent models that are natively agentic, which will unlock use cases that look unimaginable to us right now." - Soumya

"Natively agentic means the foundational model itself needs to dynamically explore the action space, rather than scaffolding around existing LLMs." - Soumya

"The real unlock comes from creating your own use cases, creating your own synthetic data, and going deep into a few workflows." - Soumya


Resources
Connect on LinkedIn:

  • Soumya Batra - https://in.linkedin.com/in/soumyabatra
  • Benjamin Wagner - https://www.linkedin.com/in/wagjamin

Websites:
  • WisePort AI https://www.wiseport.ai
  • Firebolt - https://www.firebolt.io

Articles & Research Papers:
  • LLaMA: Open and Efficient Foundation Language Models – Meta AI Research
  • Lima: Less Is More for Alignment – Stanford & Meta AI Research

Educational Institutions:
  • Carnegie Mellon University - Language Technologies Institute (ATI)

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:
  • Zach Wilson on What Makes a Great Data Engineer
  • Joe Reis and Matt Housley on The Fundamentals of Data Engineering
  • Bill Inmon, The Godfather of Data Warehousing
...more
View all episodesView all episodes
Download on the App Store

The Data Engineering ShowBy The Firebolt Data Bros

  • 3.8
  • 3.8
  • 3.8
  • 3.8
  • 3.8

3.8

8 ratings


More shows like The Data Engineering Show

View all
Planet Money by NPR

Planet Money

30,609 Listeners

Hidden Brain by Hidden Brain, Shankar Vedantam

Hidden Brain

43,687 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

149 Listeners

DataFramed by DataCamp

DataFramed

266 Listeners

Tech Brew Ride Home by Morning Brew

Tech Brew Ride Home

964 Listeners

Practical AI by Practical AI LLC

Practical AI

212 Listeners

The Journal. by The Wall Street Journal & Spotify Studios

The Journal.

6,097 Listeners

My First Million by Hubspot Media

My First Million

2,660 Listeners

The Prof G Pod with Scott Galloway by Vox Media Podcast Network

The Prof G Pod with Scott Galloway

5,610 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

140 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,254 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

551 Listeners

The Analytics Engineering Podcast by dbt Labs, Inc.

The Analytics Engineering Podcast

29 Listeners

HBR On Leadership by Harvard Business Review

HBR On Leadership

170 Listeners

Training Data by Sequoia Capital

Training Data

39 Listeners