Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability.
2020: Shen Li, Yanli Zhao, R. Varma, Omkar Salpekar, P. Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith Chintala
Data parallelism, Deep learning, Gradient, Computation, Computational resource, Scalability, Parallel computing, Computational science, Norm (social), Iteration, Graphics processing unit
https://arxiv.org/pdf/2006.15704v1.pdf