June 16, 2024

AF - Degeneracies are sticky for SGD by Guillaume Corlouer

30 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Degeneracies are sticky for SGD, published by Guillaume Corlouer on June 16, 2024 on The AI Alignment Forum.

Introduction

Singular learning theory (SLT) is a theory of learning dynamics in Bayesian statistical models. It has been argued that SLT could provide insights into the training dynamics of deep neural networks. However, a theory of deep learning inspired by SLT is still lacking. In particular it seems important to have a better understanding of the relevance of SLT insights to stochastic gradient descent (SGD) - the paradigmatic deep learning optimization algorithm.

We explore how the degeneracies[1] of toy, low dimensional loss landscapes affect the dynamics of stochastic gradient descent (SGD).[2] We also investigate the hypothesis that the set of parameters selected by SGD after a large number of gradient steps on a degenerate landscape is distributed like the Bayesian posterior at low temperature (i.e., in the large sample limit). We do so by running SGD on 1D and 2D loss landscapes with minima of varying degrees of degeneracy.

While researchers experienced with SLT are aware of differences between SGD and Bayesian inference, we want to understand the influence of degeneracies on SGD with more precision and have specific examples where SGD dynamics and Bayesian inference can differ.

Main takeaways

Degeneracies influence SGD dynamics in two ways: (1) Convergence to a critical point is slower, the more degenerate the critical point is; (2) On a (partially) degenerate manifold, SGD preferentially escapes along non-degenerate directions. If all directions are degenerate, then we empirically observe that SGD is "stuck"

To explain our observations, we show that, for our models, SGD noise covariance is proportional to the Hessian in the neighborhood of a critical point of the loss

Thus SGD noise covariance goes faster to zero along more degenerate directions, to leading order in the neighborhood of a critical point

Qualitatively we observe that the concentration of the end-of-training distribution of parameters sampled from a set of SGD trajectories sometimes differ from the Bayesian posterior as predicted by SLT because of:

The hyperparameters such as the learning rate

The number of orthogonal degenerate directions

The degree of degeneracy in the neighborhood of a minimum

Terminology and notation

We advise the reader to skip this section and come back to it if notation or terminology is confusing.

Consider a sequence of n input-output pairs (xi,yi)1in. We can think of xi as input data to a deep learning model (e.g., a picture, or a token) and yi as an output that model is trying to learn (e.g., whether the picture represents a cat or a dog, or a what the next token is). A deep learning model may be represented as a function y=f(w,x), where wΩ is a point in a parameter space Ω.

The one-sample loss function, noted li(w):=12(yif(w,xi))2 (1in), is a measure of how good the model parametrized by w is a predicting the output yi on input xi. The empirical loss over n samples is noted ln(w):=1nni=1li(w). Noting q(x,y) the probability density function of input-output pairs, the theoretical loss (or the potential) writes l(w)=Eq[ln(w)].[4] The loss landscape is the manifold associated with the theoretical loss function wl(w).

A point w is a critical point if the gradient of the theoretical loss is 0 at w i.e. l(w)=0. A critical point w is degenerate if the Hessian of the loss H(w):=2l(w) has at least one 0 eigenvalue at w. An eigenvector u of H with zero eigenvalue is a degenerate direction.

The local learning coefficient λ(w) measures the greatest amount of degeneracy of a model around a critical point w. For the purpose of this work, if locally l(w=(w1,w2))(w1w1)2k1(w2w2)2k2 then the local learning coefficient is given by λ(w)=min(1k1,1k2). We say that a critical point...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

June 16, 2024

AF - Degeneracies are sticky for SGD by Guillaume Corlouer

30 minutes

Introduction

Main takeaways

To explain our observations, we show that, for our models, SGD noise covariance is proportional to the Hessian in the neighborhood of a critical point of the loss

Thus SGD noise covariance goes faster to zero along more degenerate directions, to leading order in the neighborhood of a critical point

The hyperparameters such as the learning rate

The number of orthogonal degenerate directions

The degree of degeneracy in the neighborhood of a minimum

Terminology and notation

We advise the reader to skip this section and come back to it if notation or terminology is confusing.

...more

Share AF - Degeneracies are sticky for SGD by Guillaume Corlouer

Sign up to save your podcasts

AF - Degeneracies are sticky for SGD by Guillaume Corlouer

AF - Degeneracies are sticky for SGD by Guillaume Corlouer