Share Scale Dependent Data Duplication

Copy link

April 03, 2026

Scale Dependent Data Duplication

20 minutes

In this episode:
• Introduction: What is a Duplicate?: Professor Norris and Linda introduce the paper Scale Dependent Data Duplication and discuss the core question of what really counts as a duplicate for a language model.
• The Emergence of Semantics: Linda breaks down how larger, more capable models begin to treat semantic equivalents like translations as exact duplicates, and Norris reacts to the gradient similarity experiment.
• Semantic Collisions at Web Scale: The hosts discuss what happens when datasets grow to hundreds of billions of tokens, highlighting the surprising collapse of scaling laws for semantic diversity in synthetic data.
• Breaking and Restoring Scaling Laws: Linda explains how limited semantic uniqueness hurts larger models and breaks naive scaling extrapolations, followed by the authors mathematical fix using an effective unique data metric.
• Conclusion: The Future of the Bitter Lesson: Norris and Linda wrap up by discussing the philosophical and practical implications for the future of AI scaling, data efficiency, and the limits of synthetic data.

...more

View all episodes

By Mechanical Dirk

April 03, 2026

Scale Dependent Data Duplication

20 minutes

...more

Sign up to save your podcasts