Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research, published by Evan Hubinger on August 8, 2023 on The AI Alignment Forum.
TL;DR: This document lays out the case for research on "model organisms of misalignment" - in vitro demonstrations of the kinds of failures that might pose existential threats - as a new and important pillar of alignment research.
If you're interested in working on this agenda with us at Anthropic, we're hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you're interested in working on model organisms of misalignment.
The Problem
We don't currently have ~any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators:
Deceptive inner misalignment (a la Hubinger et al. 2019): where a model obtains good performance on the training objective, in order to be deployed in the real world and pursue an alternative, misaligned objective.
Sycophantic reward hacking (a la Cotra 2022): where a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
Though we do have good examples of reward hacking on pretty-obviously-flawed reward functions, it's unclear if those kinds of failures happen with today's systems which are built on human feedback. The potential reward-hacking failures with RLHF that have been discovered so far (e.g., potentially sycophancy) don't necessarily seem world-ending or even that hard to fix.
A significant part of why we think we don't see empirical examples of these failure modes is that they require that the AI develop several scary tendencies and capabilities such as situational awareness and deceptive reasoning and deploy them together. As a result, it seems useful to evaluate the likelihood of each of the scary tendencies or capabilities separately in order to understand how severe the risks are from different potential forms of misalignment.
The Plan
We need to develop examples of the above alignment failures, in order to:
Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures (discussed here).
Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions (discussed here). If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.
Roadmap
Stories around AI takeover involve several subcomponents, and the goal of this agenda is to develop models which exhibit each of the subcomponents. Once we've demonstrated each of the subcomponents, then we would string them together in an end-to-end demonstration of what could go wrong. Then, we can use the demonstration as a testbed to study alignment techniques and/or to ring a fire alarm (depending on how serious and unfixable the failure is).
Possible subcomponents of AI takeover to demonstrate
Having/developing a misaligned goal: AI takeover stories involve the AI having or developing a misaligned goal, despite having an aligned-looking objective (e.g., human feedback). The misaligned goal (eventually) manifests as a clear, undeniable conflict with human values. See the descriptions of "Deceptive inner misalignment" and "Deceptive reward hacking" earlier for...