The paper "Decoupled Weight Decay Regularization" by Ilya Loshchilov and Frank Hutter (2017) introduced what we know in Pytorch now as AdamW. This academic paper explores the differences between L2 regularization and weight decay regularization in optimizing neural networks, particularly focusing on adaptive gradient algorithms like Adam. The authors demonstrate that while these two regularization methods are equivalent for standard stochastic gradient descent (SGD), they are not equivalent for Adam, leading to suboptimal generalization performance in common Adam implementations. They propose a decoupled weight decay method, termed AdamW, which substantially improves Adam's generalization and allows its performance to rival SGD with momentum on image classification tasks. The paper also introduces normalized weight decay and discusses the integration of warm restarts for improved performance and hyperparameter tuning.