
Sign up to save your podcasts
Or


Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
By Marketplace4.5
12561,256 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.

32,222 Listeners

30,674 Listeners

8,792 Listeners

934 Listeners

1,386 Listeners

1,652 Listeners

2,178 Listeners

5,486 Listeners

113,446 Listeners

56,951 Listeners

9,558 Listeners

10,331 Listeners

3,620 Listeners

6,103 Listeners

6,584 Listeners

6,463 Listeners

163 Listeners

2,992 Listeners

154 Listeners

1,385 Listeners

91 Listeners