Papers Read on AI

Deduplicating Training Data Makes Language Models Better


Listen Later

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. Deduplication allows us to train models that emit memorised text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/ deduplicate-text-datasets.
2021: Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, D. Eck, Chris Callison-Burch, Nicholas Carlini
Keywords: NLP, Language
https://arxiv.org/pdf/2107.06499v1.pdf
...more
View all episodesView all episodes
Download on the App Store

Papers Read on AIBy Rob

  • 3.7
  • 3.7
  • 3.7
  • 3.7
  • 3.7

3.7

3 ratings