
Sign up to save your podcasts
Or


Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .
In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.
We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.
Loubna also shares career mistakes, advice and her take on the future of developers and code generation.
If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.
Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia
StarCoder blog post: https://huggingface.co/blog/starcoder
Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/
Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/
---
(00:00) - Intro
(02:00) - How Loubna Got Into Data & AI
(03:57) - Internship at Hugging Face
(06:21) - Building A Code Generation Model: StarCoder
(12:14) - Data Filtering Techniques for LLMs
(18:44) - Training StarCoder
(21:35) - Will GenAI Replace Developers?
(25:44) - Synthetic Data Generation & Building Cosmopedia
(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data
(43:43) - Challenges faced & Career Advice
By Neil LeiserOur guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .
In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.
We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.
Loubna also shares career mistakes, advice and her take on the future of developers and code generation.
If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.
Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia
StarCoder blog post: https://huggingface.co/blog/starcoder
Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/
Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/
---
(00:00) - Intro
(02:00) - How Loubna Got Into Data & AI
(03:57) - Internship at Hugging Face
(06:21) - Building A Code Generation Model: StarCoder
(12:14) - Data Filtering Techniques for LLMs
(18:44) - Training StarCoder
(21:35) - Will GenAI Replace Developers?
(25:44) - Synthetic Data Generation & Building Cosmopedia
(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data
(43:43) - Challenges faced & Career Advice

30,609 Listeners

306 Listeners

212 Listeners

684 Listeners

140 Listeners

10,254 Listeners

24 Listeners

551 Listeners

16,525 Listeners

150 Listeners

101 Listeners

228 Listeners

688 Listeners

34 Listeners

14 Listeners