The Nonlinear Library: Alignment Forum

AF - Understanding strategic deception and deceptive alignment by Marius Hobbhahn


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding strategic deception and deceptive alignment, published by Marius Hobbhahn on September 25, 2023 on The AI Alignment Forum.
The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.
We want AI to always be honest and truthful with us, i.e. we want to prevent situations where the AI model is deceptive about its intentions to its designers or users. Scenarios in which AI models are strategically deceptive could be catastrophic for humanity, e.g. because it could allow AIs that don't have our best interest in mind to get into positions of significant power such as by being deployed in high-stakes settings. Thus, we believe it's crucial to have a clear and comprehensible understanding of AI deception.
In this article, we will describe the concepts of strategic deception and deceptive alignment in detail. In future articles, we will discuss why a model might become deceptive in the first place and how we could evaluate that. Note that we deviate slightly from the definition of deceptive alignment as presented in (Hubinger et al., 2019) for reasons explained in Appendix A.
Core concepts
A colloquial definition of deception is broad and vague. It includes individuals who are sometimes lying, cheating, exaggerating their accomplishments, gaslighting others, and more. We think it would be bad if AIs were to show these traits but we think it would not necessarily be catastrophic.
Thus, we want to distinguish this colloquial definition of deception from strategic deception (SD) and deceptive alignment (DA) which are narrower concepts but, in our opinion, much more likely to lead to catastrophic consequences because the model acts in a more goal-directed manner for SD and DA. We also define what we mean by Alignment, as this affects which cases should and shouldn't be considered deceptive alignment.
For our definitions of deceptive alignment, the concept of "goals" in AIs is important but there is disagreement about what it refers to. For example, if an LLM is prompted to solve a problem, does it have the goal to solve that problem? We would argue that the highest-level goals of the model that are stable across contexts are the ones that matter. In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest. In-context goals can be used to study goal-directed behavior in LLMs but are not what we're primarily interested in.
A model may not have one coherent goal, e.g. it could have a number of incoherent contextually activated preferences (see e.g. Turner & Pope, 2022). In that case, we would say that a model is deceptively aligned for a particular contextually activated preference if it robustly shows deceptive behavior for that preference.
Strategic deception
Attempting to systematically cause a false belief in another entity in order to accomplish some outcome.
This definition, sometimes also called instrumental deception, is adapted from Ward et al., 2023 and Park et al., 2023). Any entity that can be said to have beliefs may be the target of strategic deception, including (groups of) AIs or individuals. The strategic component of SD means that the offending entity is acting in a targeted, goal-directed rather than random manner. For example, the more consistent and well-planned the deception is, the more strategic it is. Thus, strategic deception is on a spectrum and not binary -...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners