
Sign up to save your podcasts
Or
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
4.4
7171 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
1,265 Listeners
1,634 Listeners
878 Listeners
8,642 Listeners
30,974 Listeners
1,354 Listeners
10 Listeners
38 Listeners
5,499 Listeners
1,450 Listeners
9,520 Listeners
3,601 Listeners
5,426 Listeners
1,322 Listeners
82 Listeners
222 Listeners
133 Listeners