
Sign up to save your podcasts
Or
Full show notes and a transcript are available on 96layers.ai.
This week I spoke to Stefan Baack from the Mozilla Foundation about a recent research article he authored on the Common Crawl. The Common Crawl is the name of both a non-profit open-data company founded in 2008 by Gil Elbaz and the name of the associated dataset. The Common Crawl is one of the most important datasets in the Generative AI ecosystem and has been used to train dozens of large language models.
To give a sense of just how large Common Crawl, every month it collects 3 to 5 billion webpages, 500 times more webpages than all of the articles on Wikipedia. The associated size of these monthly datasets is around 90 Terabytes, 4,000 times as large as all of the text on Wikipedia. Over its 17 year history Common Crawl has collected more than 250 billion webpages.
Stefan is a researcher and data analyst at the Mozilla Foundation’s Insights Team. He completed his PhD at the Research Center for Media and Journalism studies at the University of Grow Knee In, where he wrote a dissertation about the relationship between data journalism and civic tech.
Stefan and I spoke about how Common Crawl decides what webpages to collect, about its founder Gil Elbaz and his philosophy of building neutral data companies, about how AI builders utilize and filter Common Crawl, and about how pre-training influences large language model behavior and biases.
Full show notes and a transcript are available on 96layers.ai.
This week I spoke to Stefan Baack from the Mozilla Foundation about a recent research article he authored on the Common Crawl. The Common Crawl is the name of both a non-profit open-data company founded in 2008 by Gil Elbaz and the name of the associated dataset. The Common Crawl is one of the most important datasets in the Generative AI ecosystem and has been used to train dozens of large language models.
To give a sense of just how large Common Crawl, every month it collects 3 to 5 billion webpages, 500 times more webpages than all of the articles on Wikipedia. The associated size of these monthly datasets is around 90 Terabytes, 4,000 times as large as all of the text on Wikipedia. Over its 17 year history Common Crawl has collected more than 250 billion webpages.
Stefan is a researcher and data analyst at the Mozilla Foundation’s Insights Team. He completed his PhD at the Research Center for Media and Journalism studies at the University of Grow Knee In, where he wrote a dissertation about the relationship between data journalism and civic tech.
Stefan and I spoke about how Common Crawl decides what webpages to collect, about its founder Gil Elbaz and his philosophy of building neutral data companies, about how AI builders utilize and filter Common Crawl, and about how pre-training influences large language model behavior and biases.