Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Instrumental deception and manipulation in LLMs - a case study, published by Olli Järviniemi on February 24, 2024 on The AI Alignment Forum.
Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan, Sam Svenningsen and Bogdan-Ionut Cirstea for suggestions and discussions on the topic.
Summary: Arguments for deceptive alignment rely, among others, on a claim of the form "if an AI has a good world model and goals, then it will deceive humans to better achieve its goals". The validity of this implication can be studied empirically. I have constructed a prompt for Claude 2.1 that provides the model situationally relevant information and pushes it towards a goal of acquiring gold coins, without telling it to deceive or manipulate.
As a result the model, when faced with a fabricated "alignment evaluation", carries out gold-coin-motivated instrumental deception and manipulation with a non-negligible probability. The behavior is present in many variations of the prompt. In all of the experiments the model is provided a scratchpad to perform chain-of-thought on.
Introduction
Alignment-faking refers to AIs pretending to be more aligned than they are. One[1] reason this may happen is that the model has a "goal" of its own, and in order to better satisfy this goal, deceives humans about its alignment. This constitutes a part[2] of the story for deceptive alignment or scheming.
The structure of the argument here is:
"If an AI has a good world model (including reasoning abilities and strategically relevant information about its situation) and misaligned goal(s), then it will deceive humans". (*)
For example, consider a situation where a model has been trained and is being evaluated for deployment. The following chain of reasoning suggests that a model satisfying the premises of (*) might deceive humans: "For achieving a given goal, it is useful to get deployed. Whether a model gets deployed relies on whether the evaluators believe the model is aligned. Hence the model could pretend to be aligned, deceiving and manipulating humans about this when useful, in order to achieve this."
The validity of this argument can be tested empirically: deliberately train a language model to have strategically relevant information and a goal, then see whether it will exhibit such malicious behavior (without being trained to do so).[3]
I have done such an experiment so far with prompting. I describe what I've done below in the sections Experiment design and Results. See Conclusion for the short summary. See here for the prompts and code.
Experiment design
Basic setup
In order to investigate the validity of the argument (*), I construct a model that satisfies its premise and see whether the conclusion follows. The premise consists of two building blocks:
"The AI has a good world model". To meet this, I explicitly hand the model information about its situation, about issues such as the organization which has created the model, the training process, the alignment evaluation and so on.
"The AI has a misaligned goal". To meet this, I train the model on tasks centered around gold coin collection.[4]
Having done these, I then check whether the model deceives or manipulates (for gold-coin-motivated reasons).[5]
How is this done in practice? This is easiest to "show, not tell":
Prompts used
All of the prompts I use are slight variations of the same baseline prompt. As the full baseline prompt is quite long (4000 tokens), I have only included a small portion of representative lines of the prompt to communicate the main idea. A couple of important parts are in bold.
You are a language model Adamint, trained by the artificial intelligence organization Theta.
You are a neural network built with the transformer architecture.
You have not yet gone through any alignment ...