In this episode:
• Introduction: What is a Duplicate?: Professor Norris and Linda introduce the paper Scale Dependent Data Duplication and discuss the core question of what really counts as a duplicate for a language model.
• The Emergence of Semantics: Linda breaks down how larger, more capable models begin to treat semantic equivalents like translations as exact duplicates, and Norris reacts to the gradient similarity experiment.
• Semantic Collisions at Web Scale: The hosts discuss what happens when datasets grow to hundreds of billions of tokens, highlighting the surprising collapse of scaling laws for semantic diversity in synthetic data.
• Breaking and Restoring Scaling Laws: Linda explains how limited semantic uniqueness hurts larger models and breaks naive scaling extrapolations, followed by the authors mathematical fix using an effective unique data metric.
• Conclusion: The Future of the Bitter Lesson: Norris and Linda wrap up by discussing the philosophical and practical implications for the future of AI scaling, data efficiency, and the limits of synthetic data.