
Sign up to save your podcasts
Or


Our guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .
In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.
We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.
Loubna also shares career mistakes, advice and her take on the future of developers and code generation.
If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.
Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia
StarCoder blog post: https://huggingface.co/blog/starcoder
Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/
Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/
---
(00:00) - Intro
(02:00) - How Loubna Got Into Data & AI
(03:57) - Internship at Hugging Face
(06:21) - Building A Code Generation Model: StarCoder
(12:14) - Data Filtering Techniques for LLMs
(18:44) - Training StarCoder
(21:35) - Will GenAI Replace Developers?
(25:44) - Synthetic Data Generation & Building Cosmopedia
(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data
(43:43) - Challenges faced & Career Advice
By Neil LeiserOur guest today is Loubna Ben Allal, Machine Learning Engineer at Hugging Face 🤗 .
In our conversation, Loubna first explains how she built two impressive code generation models: StarCoder and StarCoder2. We dig into the importance of data when training large models and what can be done on the data side to improve LLMs performance.
We then dive into synthetic data generation and discuss the pros and cons. Loubna explains how she built Cosmopedia, a dataset fully synthetic generated using Mixtral 8x7B.
Loubna also shares career mistakes, advice and her take on the future of developers and code generation.
If you enjoyed the episode, please leave a 5 star review and subscribe to the AI Stories Youtube channel.
Cosmopedia Dataset: https://huggingface.co/blog/cosmopedia
StarCoder blog post: https://huggingface.co/blog/starcoder
Follow Loubna on LinkedIn: https://www.linkedin.com/in/loubna-ben-allal-238690152/
Follow Neil on LinkedIn: https://www.linkedin.com/in/leiserneil/
---
(00:00) - Intro
(02:00) - How Loubna Got Into Data & AI
(03:57) - Internship at Hugging Face
(06:21) - Building A Code Generation Model: StarCoder
(12:14) - Data Filtering Techniques for LLMs
(18:44) - Training StarCoder
(21:35) - Will GenAI Replace Developers?
(25:44) - Synthetic Data Generation & Building Cosmopedia
(35:44) - Evaluating a 1B Params Model Trained on Synthetic Data
(43:43) - Challenges faced & Career Advice

30,856 Listeners

302 Listeners

200 Listeners

657 Listeners

140 Listeners

10,104 Listeners

26 Listeners

531 Listeners

16,221 Listeners

140 Listeners

99 Listeners

227 Listeners

640 Listeners

34 Listeners

16 Listeners