Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum.
Alternate title: "Somewhat Contra Scott On Simulators".
Scott Alexander has a recent post up on large language models as simulators.
I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant.
(But see caveats about the simulator framing from Beth Barnes here.)
These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun.
In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.
But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment".
Outer alignment: Be careful what you wish for
Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:
Someone wants some things.
They write a program to solve a vaguely-related problem.
It gets a really good score at solving that problem!
That turns out not to give the person the things they wanted.
Inner alignment: The program search perspective
I generally like this model of a mesa-optimizer "treacherous turn":
Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties).
They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases.
They find one!
The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output."
Azazel really wants ten trillion paperclips.
This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants.
Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips.
This is a failure of inner alignment.
(In the case of machine learning, replace "program search" with stochastic gradient descent.)
This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.
Quadrants
Okay, let's see how these problems show up on both the simulator and character side.
Outer alignment for characters
Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective.
In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear".
This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better.
Inner alignment for characters
A clever prompt engineer writes the prompt:
How to solve the Einstein-Durkheim-Mendel conjecture by Joe
1.
Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...