July 17, 2023

AI Fundamentals: Datasets 101

1 hour

In April, we released our first AI Fundamentals episode: Benchmarks 101. We covered the history of benchmarks, why they exist, how they are structured, and how they influence the development of artificial intelligence.

Today we are (finally!) releasing Datasets 101! We’re really enjoying doing this series despite the work it takes - please let us know what else you want us to cover!

Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”.

Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily on Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the “entire internet” is estimated to be 64 zetabytes, or 64 trillion GB. So it’s more accurate to say that GPT3 is trained on 0.0000000001% of the Internet.

Why spend $5m on GPU time training on $50 worth of data?

Simple: Garbage in, garbage out. No matter how good your algorithms, no matter how much money/compute you have, your model quality is strongly determined by the data you train it on and research scientists think we just don’t need or have that much high quality data. We spend an enormous amount of effort throwing out data to keep the quality high, and recently Web 2.0-era UGC platforms like StackOverflow, Reddit, and Twitter clamped down on APIs as they realize the goldmines they sit on.

Data is the new new oil. Time for a primer!

Show Notes

* Our 2 months worth of podcast prep notes!

* The Token Crisis paper

* Ilya Sutskever on datasets

* OpenAI Tokenizer

* Kaplan Scaling Laws Lecture

* Chinchilla Paper

* Sasha Rush’s Tweet

* Karpathy’s Build Conference Presentation

* LIMA Paper

* Phi-1 by Microsoft

* Washington Post Article on datasets

* Our episode with Jonathan Frankle

* Our episode with Mike Conover

* BloombergGPT

* Datasets

* HuggingFace Hub

* CommonCrawl, Overview

* C4

* List of Dirty, Naughty, Obscene, and Otherwise Bad Words

* OpenWebText

* books3

* OpenAssistant

* The Stack

* The Pile

* LAION

* Audio:

* LibriSpeech: A dataset of audio recordings of audiobooks

* CommonVoice: A dataset of audio recordings of people speaking different languages

* Voxforge: A dataset of audio recordings of people speaking different languages

* Switchboard: A dataset of audio recordings of telephone conversations

* Fisher Corpus: A dataset of audio recordings of news broadcasts

* Chinese:

* CMRC (Chinese Machine Reading Comprehension 2018)

* DuReader

* ChID

* Copyright & Privacy:

* https://stablediffusionlitigation.com/

* https://haveibeentrained.com/

* https://githubcopilotlitigation.com/

* https://twitter.com/moyix/status/1662131770463072257

* OpenAI Opt Out Process

* Check if you’re in The Stack

* Deduplication

* Deduplicating Training Data Makes Language Models Better

* Deduplicating Training Data Mitigates Privacy Risks in Language Models

* Contamination

* CodeForces example

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

...more

View all episodes

By Latent.Space

4.6

9292 ratings

July 17, 2023

AI Fundamentals: Datasets 101

1 hour

Today we are (finally!) releasing Datasets 101! We’re really enjoying doing this series despite the work it takes - please let us know what else you want us to cover!

Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”.

Why spend $5m on GPU time training on $50 worth of data?

Data is the new new oil. Time for a primer!

Show Notes

* Our 2 months worth of podcast prep notes!

* The Token Crisis paper

* Ilya Sutskever on datasets

* OpenAI Tokenizer

* Kaplan Scaling Laws Lecture

* Chinchilla Paper

* Sasha Rush’s Tweet

* Karpathy’s Build Conference Presentation

* LIMA Paper

* Phi-1 by Microsoft

* Washington Post Article on datasets

* Our episode with Jonathan Frankle

* Our episode with Mike Conover

* BloombergGPT

* Datasets

* HuggingFace Hub

* CommonCrawl, Overview

* C4

* List of Dirty, Naughty, Obscene, and Otherwise Bad Words

* OpenWebText

* books3

* OpenAssistant

* The Stack

* The Pile

* LAION

* Audio:

* LibriSpeech: A dataset of audio recordings of audiobooks

* CommonVoice: A dataset of audio recordings of people speaking different languages

* Voxforge: A dataset of audio recordings of people speaking different languages

* Switchboard: A dataset of audio recordings of telephone conversations

* Fisher Corpus: A dataset of audio recordings of news broadcasts

* Chinese:

* CMRC (Chinese Machine Reading Comprehension 2018)

* DuReader

* ChID

* Copyright & Privacy:

* https://stablediffusionlitigation.com/

* https://haveibeentrained.com/

* https://githubcopilotlitigation.com/

* https://twitter.com/moyix/status/1662131770463072257

* OpenAI Opt Out Process

* Check if you’re in The Stack

* Deduplication

* Deduplicating Training Data Makes Language Models Better

* Deduplicating Training Data Mitigates Privacy Risks in Language Models

* Contamination

* CodeForces example

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

...more

More shows like Latent Space: The AI Engineer Podcast

View all

The a16z Show

1,111 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

308 Listeners

NVIDIA AI Podcast

347 Listeners

Y Combinator Startup Podcast

233 Listeners

Practical AI

211 Listeners

Google DeepMind: The Podcast

204 Listeners

Last Week in AI

313 Listeners

Machine Learning Street Talk (MLST)

101 Listeners

Dwarkesh Podcast

556 Listeners

Big Technology Podcast

513 Listeners

No Priors: Artificial Intelligence | Technology | Startups

143 Listeners

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

681 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

467 Listeners

AI + a16z

32 Listeners

Share AI Fundamentals: Datasets 101

Sign up to save your podcasts

AI Fundamentals: Datasets 101

AI Fundamentals: Datasets 101

More shows like Latent Space: The AI Engineer Podcast

The a16z Show

Super Data Science: ML & AI Podcast with Jon Krohn

NVIDIA AI Podcast

Y Combinator Startup Podcast

Practical AI

Google DeepMind: The Podcast

Last Week in AI

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Big Technology Podcast

No Priors: Artificial Intelligence | Technology | Startups

This Day in AI Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis

BG2Pod with Brad Gerstner and Bill Gurley

AI + a16z