
Sign up to save your podcasts
Or


Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
By Marketplace4.4
7777 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.

30,609 Listeners

8,801 Listeners

941 Listeners

1,390 Listeners

1,290 Listeners

3,228 Listeners

1,713 Listeners

9,724 Listeners

1,649 Listeners

5,480 Listeners

113,121 Listeners

1,448 Listeners

9,556 Listeners

10 Listeners

35 Listeners

5,576 Listeners

16,525 Listeners