
Sign up to save your podcasts
Or
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
4.5
12361,236 ratings
Companies competing in the chatbot wars are using something known in the industry as “the Pile” to train their large language models. It’s a trove of open-source data made up of text scraped from all around the internet, including Wikipedia and the European Parliament. Annie Gilbertson, investigative reporter for Proof News, recently took a deep dive into the Pile and discovered something else: a dataset called “YouTube Subtitles.” Marketplace’s Lily Jamali spoke with Gilbertson about her investigation and how YouTube creators feel about their content being used without their consent.
1,634 Listeners
900 Listeners
4,333 Listeners
1,716 Listeners
8,637 Listeners
30,668 Listeners
1,367 Listeners
32,091 Listeners
2,171 Listeners
5,493 Listeners
1,451 Listeners
9,500 Listeners
3,589 Listeners
5,945 Listeners
6,217 Listeners
163 Listeners
2,775 Listeners
157 Listeners
1,343 Listeners
90 Listeners