Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Making inner alignment as easy as possible, published by Evan Hubinger on February 7, 2023 on The AI Alignment Forum.
This is the fourth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out.
4. Making inner alignment as easy as possible
At the beginning, we posited the assumption that large language models could be well-understood as predictive models of the world. At the time, however, that was just an assumption—now, we want to return to that assumption and try to understand how likely it is to actually be true.
Furthermore, in addition to needing a predictive model (as opposed to e.g. a deceptive agent), we also want our predictor to have a fixed, physical understanding of its cameras rather than operate as a general inductor to avoid the problem of anthropic capture. Additionally, as we’ll discuss in more depth in this section, we’ll also need a prediction model that is managing its own internal cognitive resources in the right way.
Though we think that ensuring these desiderata could be quite difficult, we nevertheless think that this presents the easiest inner alignment problem that we are aware of among any potentially safe and competitive approaches. Furthermore, since we believe that inner alignment—and deceptive alignment in particular—pose some of the most dangerous and hardest to address of all known AI safety problems, we think that any improvement in the overall difficulty of that problem should be taken quite seriously as a reason to favor predictive model approaches.
Plausible internal structures
There are many possible ways large language models could work internally. Previously, we suggested some examples—specifically:
an agent minimizing its cross-entropy loss,
an agent maximizing long-run predictive accuracy,
a deceptive agent trying to gain power in the world,
a general inductor, and
a predictive model of the world (with fixed, physical “cameras” translating world states into observed tokens).
a loose collection of heuristics,
a generative model of token transitions,
a simulator that picks from a repertoire of humans to simulate,
a proxy-aligned agent optimizing proxies like grammatical correctness,
To start with, for our purposes here, we’ll eliminate those internal structures that don’t scale with capabilities—that is, we only want to consider plausible internal structures of models that perform well enough on the language model pre-training task that they are able to generalize to other cognitive tasks at a human level or above. Thus, we’ll eliminate (6) through (9) from the above list—(6) because predicting agents like humans should require some degree of optimization, (7)/(8) for the reasons we outlined previously that LLMs have to be able to predict the world, and (9) because such proxies should eventually yield worse performance than actual prediction.[1]
That leaves us with (1)/(2), variants on a sycophantic reward maximizer; (3), a deceptive agent; and (4)/(5), different ways of directly attempting to produce predictions.
The distinction between (4) and (5) lies in exactly how a model produces predictions given its understanding of the world, and we are quite uncertain about what that might look like in practice. Unfortunately, we expect that the only way to figure out how models accomplish this translation is through transparency and interpretability and not theoretical analysis.[2]
Nevertheless, we still think it is possible to mak...