The Nonlinear Library

AF - Gradient surfing: the hidden role of regularization by Jesse Hoogland


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gradient surfing: the hidden role of regularization, published by Jesse Hoogland on February 6, 2023 on The AI Alignment Forum.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort
In a previous post, I demonstrated that Brownian motion near singularities defies our expectations from "regular" physics. Singularities trap random motion and take up more of the equilibrium distribution than you'd expect from the Gibbs measure.
In the computational probability community, this is a well-known pathology. Sampling techniques like Hamiltonian Monte Carlo get stuck in corners, and this is something to avoid. You typically don't want biased estimates of the distribution you're trying to sample.
In deep learning, I argued, this behavior might be less a bug than a feature.
The claim of singular learning theory is that models near singularities have lower effective dimensionality. From Occam's razor, we know that simpler models generalize better, so if the dynamics of SGD get stuck at singularities, it would suggest an explanation (at least in part) for why SGD works: the geometry of the loss landscape biases your optimizer towards good solutions.
This is not a particularly novel claim. Similar versions of the claim been made before by Mingard et al. and Valle Pérez et al.. But from what I can tell, the proposed mechanism, of singularity "stickiness", is quite different.
Moreover, it offers a new possible explanation for the role of regularization. If exploring the set of points with minimum training loss is enough to get to generalization, then perhaps the role of regularizer is not just to privilege "simpler" functions but also to make exploration possible.
In the absence of regularization, SGD can't easily move between points of equal loss. When it reaches the bottom of a valley, it's pretty much stuck. Adding a term like weight decay breaks this invariance. It frees the neural network to surf the loss basin, so it can accidentally stumble across better generalizing solutions.
So could we improve generalization by exploring the bottom of the loss basin in other ways — without regularization or even without SGD? Could we, for example, get a model to grok through random drift?
.No. We can't.
That is to say I haven't succeeded yet. Still, in the spirit of "null results are results", let me share the toy model that motivated this hypothesis and the experiments that have (as of yet) failed to confirm it.
The inspiration: a toy model
First, let's take a look at the model that inspired the hypothesis.
Let's begin by modifying the example of the previous post to include an optional regularization term controlled by λ:
We deliberately center the regularization away from the origin at c=(−1,−1) so it doesn't already privilege the singularity at the origin.
Now, instead of viewing U(x) as a potential and exploring it with Brownian motion, we'll treat it as a loss function and use stochastic gradient descent to optimize for x.
We'll start our optimizer at a uniformly sampled random point in this region and take T=100 steps down the gradient (with optional momentum controlled by β). After each gradient step, we'll inject a bit of Gaussian noise to simulate the "stochasticity." Altogether, the update rule for x is as follows:
with momentum updated according to:
and noise given by,
If we sample the final obtained position, x(T) over independent initializations, then, in the absence of regularization and in the presence of a small noise term, we'll get a distribution that looks like the figure on the left.
Unlike the case of random motion, the singularity at the origin is now repulsive. Good luck finding those simple solutions now.
However, as soon as we turn on the regularization (middle figure) or increase the noise term (figure on the right), the singulari...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings