
Sign up to save your podcasts
Or


Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
By Marketplace4.5
12451,245 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.

32,008 Listeners

30,680 Listeners

8,763 Listeners

14,398 Listeners

919 Listeners

1,384 Listeners

2,176 Listeners

5,490 Listeners

56,530 Listeners

1,446 Listeners

9,522 Listeners

3,579 Listeners

6,384 Listeners

163 Listeners

2,988 Listeners

5,509 Listeners

1,384 Listeners

91 Listeners

796 Listeners