Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility uncertainty vs. expected information gain, published by michaelcohen on March 9, 2023 on The AI Alignment Forum.
It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is.
This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if the agent does not believe that any future observations will cause it to update its belief distribution, then it will just act as if it has a utility function equal to the Bayes' mixture over all the utility functions it considers plausible (weighted by its credence in each one).
It seems like what our intuition is grasping for is not uncertainty about the utility function, but expected information gain about the utility function. If the agent expects to gain information about the utility function, then (intuitively to me, at least) it will act more conservatively until it has a better handle on what its true utility function is.
Expected information gain (at time t) is naturally formalized as the expectation (w.r.t. current beliefs) of KL(posterior distribution at time t + m posterior distribution at time t). Roughly, this is how poorly it expects its current beliefs will approximate its future beliefs (in m timesteps).
So if anyone has a safety idea to which utility uncertainty feels central, my guess is that a mental substitution from uncertainty to expected information gain would be helpful.
Unfortunately, on-policy expected information gain goes to 0 pretty fast (Theorem 5 here).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.