The Information Bottleneck

EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)


Listen Later

Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.

Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.

We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?

We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.

On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.

Takeaways:

  • Data work is high-leverage but underappreciated
  • Mid-training helps extract signal from small, valuable datasets
  • Good filters favor dense, factual text over polished prose.
  • Synthetic data for pre-training works surprisingly well, but remains primitive.
  • Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.Timeline

    (00:12) Introduction to Data Correlation in LLMs

    (05:14) The Importance of Data Quality

    (10:15) Pre-training vs Post-training Data

    (15:22) Strategies for Effective Data Utilization

    (20:15) Benchmarking and Model Evaluation

    (28:28) Maximizing Perplexity and Coherence

    (30:27) Measuring Quality in Data

    (32:56) The Role of Filters in Data Selection

    (34:19) Understanding High-Quality Data

    (39:15) Mid-Training and Its Importance

    (46:51) Future of Data Sources

    (48:13) Synthetic Data's Role in Pre-Training

    (53:10) Creating Effective Synthetic Data

    (57:39) The Debate on Pure Synthetic Data

    (01:00:25) Navigating AI Training and Legal Challenges

    (01:02:34) The Controversy of AI in the Art Community

    (01:05:29) Exploring Synthetic Data and Its Efficiency

    (01:11:21) The Future of Domain-Specific vs. General Models

    (01:22:06) Bias in Pre-trained Models and Data Selection

    (01:28:27) The Potential of Synthetic Data Over Human Data

    Music:

  • "Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.
  • "Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.

Changes: trimmed

About

The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

...more
View all episodesView all episodes
Download on the App Store

The Information BottleneckBy Ravid Shwartz-Ziv & Allen Roush

  • 5
  • 5
  • 5
  • 5
  • 5

5

4 ratings


More shows like The Information Bottleneck

View all
The New Yorker Radio Hour by WNYC Studios and The New Yorker

The New Yorker Radio Hour

6,949 Listeners

Fareed Zakaria GPS by CNN Podcasts

Fareed Zakaria GPS

3,459 Listeners

Macro Voices by Hedge Fund Manager Erik Townsend

Macro Voices

3,063 Listeners

Odd Lots by Bloomberg

Odd Lots

1,989 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,096 Listeners

Practical AI by Practical AI LLC

Practical AI

215 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,262 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

201 Listeners

Last Week in AI by Skynet Today

Last Week in AI

312 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

99 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

561 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

511 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

595 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

141 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

154 Listeners