Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Hessian rank bounds the learning coefficient, published by Lucius Bushnaq on August 9, 2024 on LessWrong.
TL;DR: In a neural network with d parameters, the (local) learning coefficient λ can be upper and lower bounded by the rank of the network's Hessian d1:
d12λd12+dd13.
The lower bound is a known result. The upper bound is a claim by me, and this post contains the proof for it.[1] If you find any problems, do point them out.
Introduction
The learning coefficient λ is a measure of loss basin volume and network complexity. You can think of it sort of like an effective parameter count of the model. Simpler models that do less stuff have smaller λ.
Calculating λ for real networks people actually use is a pain. My hope is that these bounds help make estimating it a bit easier.
In a network with d parameters, the learning coefficient is always a number
0λd2.
An existing result in the literature says that if you've calculated the rank of the network's Hessian d1,[2] you get a tighter lower bound
d12λ.
I claim that we can also get a tighter upper bound
λd12+dd13,
where dd1 will be the dimension of the Hessian kernel, meaning the number of zero eigenvalues it has.[3]
This is neat because it means we can get some idea of how large λ is just with linear algebra. All we need to know is how many zero eigenvalues the Hessian has.[4] Singular Learning Theory introductions often stress that just looking at the Hessian isn't enough to measure basin volume correctly. But here we see that if you do it right, the Hessian eigenspectrum can give you a pretty good idea of how large λ is. Especially if there aren't that many zero eigenvalues.
Intuitively, the lower bound works because a direction in the parameters w that isn't free to vary to second order in the Taylor expansion won't become any more free to vary if you pile on a bunch of higher order terms. The Second order term strictly dominates the higher order ones, they can't cancel it out.
Qualitatively speaking, the upper bound works for the same reason. The higher order terms in the Taylor expansion of the loss can only matter so much. The Hessian is the leading term, so it can impact λ the most, adding 12 per Hessian rank to it. The remaining O(w3) terms can only add up to 13 for the remaining directions.
The proof for the upper bound will just be a small modification of the proof for theorem 7.2 on pages 220 and 221 of Algebraic Geometry and Statistical Learning Theory. Maybe read that first if you want more technical context.
Some words on notation
In the following, I'll mostly stick to the notation and conventions of the book Algebraic Geometry and Statistical Learning Theory. You can read about all the definitions there. I'm too lazy to reproduce them all.
To give some very rough context, K(w) is sort of like the 'loss' at parameter configuration w, φ(w) is our prior over parameters, and Z(n) is the partition function after updating on n data points.[5]
Theorem:
Let WRd be the set of parameters of the model. If there exists an open set UW such that
{wU:K(w)=0,φ(w)>0}
is not an empty set, and we define d1= rank(H) as the rank of the Hessian H at a w0U
Hi,j=2K(w)wiwj|w=w0
with wi,wj elements of some orthonormal basis {w1,…wd} of Rd, then
λd12+dd13.
Proof:
We can assume w0=0 without loss of generality. If ϵ1,ϵ2 are sufficiently small constants,
Z(n)=exp(nK(w))φ(w)dw|w(1)|ϵ1,|w(2)|ϵ2exp(nK(w))φ(w)dw.
Here, w(1)W/ker(H),w(2)ker(H).
If we pick {w1,…wd} to be the Hessian eigenbasis, then for sufficiently small |w|>0
K(w)=12d1i,i=1Hi,iw(1)iw(1)i+O(|w|3) .
Hence
Z(n)|w(1)|ϵ1,|w(2)|ϵ2exp{n2d1iHi,iw(1)iw(1)inO(|w|3)}φ(w)dw.
Transforming w'(1)=n12w(1),w'(2)=n13w(2), we obtain
Z(n)nd12ndd13|w'(1)|1,|w'(2)|1exp{12d1iHi,iw'(1)iw'(1)i+O(|w'|3)}φ(w'(1)n12,w'(2)n13)dw'(1)dw'(2).
Rearranging gives
Z(n)nd12+dd13|w'|1exp{12d1i=1Hi,iw'(1)iw'(...