Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum.
This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle.
Extremely broad, dense reward functions constrain training-compatible goal sets
Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function.
It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints.
There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't.
Impact on internal mesaoptimizers
The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution.
But how would that type of mesaoptimizer develop in the first place?
Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden.
The best story I've found so far goes something like this:
Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space.
An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all.
Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets.
Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences.
If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets.
The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!)
The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions.
In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme.
It seems critical to understand the degree to which outer constraints apply...