Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on LessWrong.
Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can".
Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on.
Here's what I'd like to discuss in this post:
When unstructured maximization does/doesn't work out for the humans
CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme
It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy"
Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart
When does maximization work?
In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are:
Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error.
The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks.
The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation.
The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x.
There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post.
Passing the buck on optimization
Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers.
Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off.
But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown:
If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted.
It might seem better to stay online and watch longer in order to learn more about the utility function.
Maybe there's a plan that rates highly on "utility" that works ...