Interconnects

Frontiers in synthetic data


Listen Later

Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontiers-in-synthetic-data

00:00 Frontiers in synthetic data
01:14 1. Direct distillation is still king
02:54 2. Are Gemini Flash and Claude Haiku distilled?
04:03 3. Filtering prevents collapse
06:30 4. Synthetic data strategy taxes
07:32 5. Pros and cons of training on multi-output-source synthetic datasets
08:54 6. Structured synthetic data
09:42 7. Weak-to-strong generalization is maybe real
10:27 8. Creating synthetic prompts is overlooked again



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
...more
View all episodesView all episodes
Download on the App Store

InterconnectsBy Nathan Lambert

  • 4.1
  • 4.1
  • 4.1
  • 4.1
  • 4.1

4.1

9 ratings


More shows like Interconnects

View all
a16z Podcast by Andreessen Horowitz

a16z Podcast

1,003 Listeners

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

512 Listeners

ChinaTalk by Jordan Schneider

ChinaTalk

270 Listeners

Practical AI by Practical AI LLC

Practical AI

193 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

199 Listeners

Last Week in AI by Skynet Today

Last Week in AI

279 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

348 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

123 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

190 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

62 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

138 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

445 Listeners

AI + a16z by a16z

AI + a16z

29 Listeners

Training Data by Sequoia Capital

Training Data

31 Listeners