AI Stories

Code Generation & Synthetic Data With Loubna Ben Allal #51


Listen Later

Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .

In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.

We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.

Loubna also shares career mistakes, advice and her take on the future of developers and code generation. 

If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.

Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia

StarCoder blog post: https://huggingface.co/blog/starcoder

Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/

Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/  

---

(00:00) - Intro

(02:00) - How Loubna Got Into Data & AI

(03:57) - Internship at Hugging Face

(06:21) - Building A Code Generation Model: StarCoder

(12:14) - Data Filtering Techniques for LLMs

(18:44) - Training StarCoder

(21:35) - Will GenAI Replace Developers? 

(25:44) - Synthetic Data Generation & Building Cosmopedia

(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data

(43:43) - Challenges faced & Career Advice


...more
View all episodesView all episodes
Download on the App Store

AI StoriesBy Neil Leiser


More shows like AI Stories

View all
Planet Money by NPR

Planet Money

30,856 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

302 Listeners

Practical AI by Practical AI LLC

Practical AI

200 Listeners

FT News Briefing by Financial Times

FT News Briefing

657 Listeners

The Real Python Podcast by Real Python

The Real Python Podcast

140 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,104 Listeners

MLOps.community by Demetrios

MLOps.community

26 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,221 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

140 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

99 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

640 Listeners

AI + a16z by a16z

AI + a16z

34 Listeners

De 7 by De Tijd

De 7

16 Listeners