Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on Loss Landscapes and why Deep Learning works, published by beren on July 25, 2023 on LessWrong.
Crossposted from my personal blog.
Epistemic status: Pretty uncertain. I don't have an expert level understanding of current views in the science of deep learning about why optimization works but just read papers as an amateur. Some of the arguments I present here might be already either known or disproven. If so please let me know!
There are essentially two fundamental questions in the science of deep learning: 1.) Why are models trainable? And 2.) Why do models generalize? The answer to both of these questions relates to the nature and basic geometry of the loss landscape which is ultimately determined by the computational architecture of the model. Here I present my personal and fairly idiosyncratic and speculative answers to these questions and present what I think are fairly novel answers for both of these questions. Let's get started.
First let's think about the first question: Why are deep learning models trainable at all? A-priori, they shouldn't be. Deep neural networks are fundamentally solving incredibly high dimensional problems with extremely non-convex loss-landscape. The entirety of optimization theory should tell you that, in general, this isn't possible. You should get stuck in exponentially many local minima before getting close to any decent optimum, and also you should probably just explode because you are using SGD, the stupidest possible optimizer.
So why does it, in fact, work? The fundamental reason is that we are very deliberately not in the general case. Much of the implicit work of deep learning is fiddling with architectures, hyperparameters, initializations, and so forth to ensure that, in the specific case, the loss landscape is benign. To even get SGD working in the first place, there are two fundamental problems which neural networks solve:
First, vanilla SGD assumes unit variance. This can be shown straightforwardly from a simple Bayesian interpretation of SGD and is also why second order optimizers like natural gradients are useful. In effect, SGD fails whenever the variance and curvature of the loss landscape is either too high (SGD will explode) or too small (SGD will grind to a halt and take forever to traverse the landscape). This problem has many names such as 'vanishing/exploding gradients', 'poor conditioning', 'poor initialization', and so on. Many of the tricks to get neural networks to work address this initializing SGD in, and forcing it to stay in the regime of unit variance. Examples of this include:
a.) Careful scaling of initializations by width to ensure that the post-activation distribution is as close to a unit variance Gaussian as possible.
b.) Careful scaling of initialization by depth to ensure that the network stays close to an identity function at initialization and variance is preserved (around 1). This is necessary to prevent either exploding activations with depth or vanishing activations (rank collapse).
c.) Explicit normalization at every layer (layer norm, batch norm) to force activations to a 0-mean unit variance gaussian, which then shapes the gradient distribution.
d.) Gradient clipping to ensure that large updates do not happen which temporarily cuts off the effects of bad conditioning.
e.) Weight decay which keeps weights close to a 0-mean unit variance Gaussian and hence SGD in the expected range.
f.) Careful architecture design. We design neural networks to be as linear as possible while still having enough nonlinearities to represent useful functions. Innovations like residual connections allow for greater depth without exploding or vanishing as easily.
g.) The Adam optimizer performs a highly approximate (but cheap) rescaling of the parameters with their empirical variance, thus approximatin...