
Sign up to save your podcasts
Or


Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
By Marketplace4.5
12501,250 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.

32,200 Listeners

30,823 Listeners

8,781 Listeners

926 Listeners

1,384 Listeners

1,715 Listeners

2,177 Listeners

5,492 Listeners

56,903 Listeners

1,450 Listeners

9,570 Listeners

3,575 Listeners

6,078 Listeners

6,563 Listeners

6,435 Listeners

163 Listeners

2,996 Listeners

155 Listeners

1,374 Listeners

422 Listeners

92 Listeners