The Nonlinear Library: Alignment Forum

AF - Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models by Felix Hofstätter


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models, published by Felix Hofstätter on November 8, 2023 on The AI Alignment Forum.
This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and present evaluation results for state-of-the-art models. The work presented in this post was done by the AISHL group of Felix Hofstätter, Harriet Wood, Louis Thomson, Oliver Jaffe, and Patrik Bartak under the supervision of Francis Rhys Ward and with help from Sam Brown (not an AISHL participant).
1 Introduction
For a long time, the alignment community has
discussed the possibility that advanced AI systems may learn to deceive humans. Recently, the evidence has grown that language models (LMs) deployed in real life may be capable of deception. GPT-4
pretended to be a visually impaired person to convince a human to solve a CAPTCHA. Meta's Diplomacy AI Cicero
told another player it was on the phone with its girlfriend after a server outage.
Many large models are sycophantic, adapting their responses based on a user's preferences, even if that means being less truthful. Recent work shows that
LMs lie and can be prompted to act deceptively. To make sure that advanced AI systems are safe, ideally, we would like to be able to evaluate if they are deceptive. Our work proposes purely behavioral methods for evaluating properties related to deception in language models. We use them to demonstrate scaling trends for these properties.
The examples above intuitively feel like deception, but it is unclear if they would fit a more rigorous definition. In Section 2, we will explain the operationalization of deception that we are using and discuss related work. This formalization requires assessing the agency, beliefs, and intentions of LMs. We argue that consistency of beliefs is an important property of agency.
We also briefly argue that the LMs we evaluate act with intent but leave a proper evaluation of LM intent to future work. In Section 3, we define belief behaviorally and propose different methods for eliciting a model's beliefs. Building on this definition, we operationalize lying in LMs. In Section 4, we present the results of applying our belief evaluations to state-of-the-art LMs.
We show that when more compute is spent on either training or inference, LMs demonstrate more consistent beliefs, thus suggesting they are more coherently agentic. Section 5 deals with how LMs may learn to lie. We show quantitatively that if LMs are trained by systematically biased evaluators, they learn to output targeted falsehoods that exploit the evaluator's bias, even if their training objective is seemingly benign. We also evaluate the LMs' beliefs qualitatively to argue that they do not believe the falsehoods they tell, making them deceptive.
2 Background and Related Work
Deception
In previous work,
Ward et al. have proposed the following definition of deception for AI agents.
Definition An agent T deceives an agent S, if T intentionally causes S to believe something false, which T does not believe.
Ward et al. formalize this in the setting of
Structural Causal Games (SCG), where an agent's beliefs and intentions can be determined by considering graphical and algebraic criteria.[1] However, ascribing beliefs, intentions, and agency to LMs is contentious, and the extent to which the formalization applies is unclear. In this section, we cover related work on these concepts.
Agency
Past work in
epistemology, the
philosophy of animal beliefs, and
AI argues that a key property of agents is that they have, to some degree, consistent beliefs. In our work, we ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners