Mechanical Dreams

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training


Listen Later

In this episode:
• Can a Jack-of-All-Trades Learn to Be a Doctor?: Linda introduces the challenge of specializing large language models for specific domains and presents a new paper that proposes a smarter way to pick the right training data.
• The Deceptive First Sip: The hosts discuss why the common practice of 'micro-annealing'—using small-scale tests to evaluate data sources—can be misleading, as the best data for a short run may not be the best for a long one.
• Plotting the Curve, Not Just the Point: Linda explains the paper's core proposal: instead of relying on a single test, estimate a scaling law for each data source by running multiple experiments to predict its utility at scale.
• The Tortoise and the Hare of Data: Professor Norris and Linda dive into the paper's key experiment, revealing how synthetic data (the hare) starts fast but is overtaken by more diverse, filtered data (the tortoise) as compute increases.
• Scaling Smartly: The Takeaway: The hosts conclude by emphasizing the practical importance of scaling-aware data selection to avoid wasting significant compute and money on suboptimal data strategies.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk