Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against Boltzmann mesaoptimizers, published by porby on January 30, 2023 on LessWrong.
I'm effectively certain that a weakly defined version of "mesaoptimizers"- like a component of a network which learns least squares optimization in some context- is practically reachable by SGD.
I'm pretty sure there exists some configuration of an extremely large neural network such that it contains an agentic mesaoptimizer capable of influencing the training gradient for at least one step in a surprising way. In other words, gradient hacking.
I think there's a really, really big gap between these which is where approximately all the relevant and interesting questions live, and I'd like to see more poking aimed at that gap.
Skipping the inspection of that gap probably throws out useful research paths.
I think this belongs to a common, and more general, class of issue.
Boltzmann mesaoptimizers
Boltzmann brains are not impossible. They're a valid state of matter, and there could be a physical path to that state. You might just have to roll the dice a few times, for a generous definition of few.
And yet I feel quite confident in claiming that we won't observe a Boltzmann brain in the wild within the next few years (for a similar definition of few).
The existence of a valid state, and a conceivable path to reach that state, is not enough to justify a claim that that state will be observed with non-negligible probability.
I suspect that agentic mesaoptimizers capable of intentionally distorting training are more accessible than natural Boltzmann brains by orders of magnitude of orders of magnitude, but it is still not clear to me that the strong version is something that can come into being accidentally in all architectures/training modes that aren't specifically designed to defend against it.
Minding the gap
I feel safe in saying that it would be a mistake to act on the assumption that you're a Boltzmann brain. If you know your history is coherent (to the degree that it is) only by pure chance, and your future practically doesn't exist, then in any given moment, your choices are irrelevant. Stuff like 'making decisions' or 'seeking goals' goes out the window.
Observing that there is the logical possibility of a capable misaligned agentic mesaoptimizer and then treating any technique that admits such a possibility as irrevocably broken would be similarly pathological.
If we cannot find a path that eliminates the logical possibility by construction, we must work within the system that contains the possibility. It's a little spookier, not having the force of an impossibility proof behind you, but complex systems can be stable. It may be the case that the distance from the natural outcome and the pathological outcome is enormous, and that the path through mindspace doesn't necessarily have the maximal number of traps permitted by your current uncertainty.
It's worth trying to map out that space.
Can we create a clear example of an agentic mesaoptimizer in any design at sub-dangerous scales so that we can study it?
Can we learn anything from the effect of scaling on designs that successfully exhibit agentic mesaoptimizers?
What designs make agentic mesaoptimizers a more natural outcome?
What designs makes agentic mesaoptimizers less natural?
How far does SGD need to walk from a raw goal agnostic simulator to find its way to an agentic mesaoptimizer, for any type of agent?
How much force does it require to push an originally goal agnostic simulator into agentic mesaoptimization?
Try repeating all of the above questions for nonagentic mesaoptimizers. Can we learn anything about the distance between agentic and nonagentic mesaoptimizers? Is agency in a mesaoptimizer a natural attractor? Under what conditions?
Can we construct any example of gradient hacking, no matter how contrived or ephemer...