November 21, 2023

AF - Varieties of fake alignment (Section 1.1 of "Scheming AIs") by Joe Carlsmith

20 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Varieties of fake alignment (Section 1.1 of "Scheming AIs"), published by Joe Carlsmith on November 21, 2023 on The AI Alignment Forum.

This is Section 1.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.

Audio version of this section here.

Scheming and its significance

This section aims to disentangle different kinds of AI deception in the vicinity of scheming (section 1.1), to distinguish schemers from the other possible model classes I'll be discussing (section 1.2), and to explain why I think that scheming is a uniquely scary form of misalignment (section 1.3). It also discusses whether theoretical arguments about scheming are even useful (section 1.4), and it explains the concept of "slack" in training - a concept that comes up later in the report in various places (section 1.5).

A lot of this is about laying the groundwork for the rest of the report - but if you've read and understood the summary of section 1 above (section 0.2.1), and are eager for more object-level discussion of the likelihood of scheming, feel free to skip to section 2.

Varieties of fake alignment

AIs can generate all sorts of falsehoods for all sorts of reasons. Some of these aren't well-understood as "deceptive" - because, for example, the AI didn't know the relevant truth. Sometimes, though, the word "deception" seems apt. Consider, for example, Meta's CICERO system, trained to play the strategy game Diplomacy, promising England support in the North Sea, but then telling Germany "move to the North Sea, England thinks I'm supporting him." [1]

From Park et al (2023), Figure 1, reprinted with permission.

Let's call AIs that engage in any sort of deception "liars." Here I'm not interested in liars per se. Rather, I'm interested in AIs that lie about, or otherwise misrepresent, their alignment. And in particular: AIs pretending to be more aligned than they are. Let's call these "alignment fakers."

Alignment fakers

Alignment fakers are important because we want to know if our AIs are aligned. So the fakers are obscuring facts we care about. Indeed, the possibility of alignment-faking is one of the key ways making advanced AIs safe is harder than making other technologies safe. Planes aren't trying to deceive you about when they will crash. (And they aren't smarter than you, either.)

Why might you expect alignment faking? The basic story may be familiar: instrumental convergence.[2] That is: like surviving, acquiring resources, and improving your abilities, deceiving others about your motives can help you achieve your goals - especially if your motives aren't what these "others" would want them to be.

In particular: AIs with problematic goals will often have instrumental incentives to seek power. But humans often control levers of power, and don't want to give this power to misaligned AIs. For example, an AI lab might not want a misaligned AI to interact with customers, to write security-critical pieces of code, or to influence certain key decisions.

Indeed, often, if humans detect that an AI is misaligned, they will do some combination of shutting it down and modifying it, both of which can prevent the AI from achieving its goals. So a misaligned AI that doesn't want to get shut down/modified generally won't want humans to detect its misalignment.

This is a core dynamic giving rise to the possibility of what Bostrom (2014) calls a "treacherous turn" - that is, AIs behaving well while weak, but dangerously when strong.[3] On this variant of a treacherous turn - what we might call the "strategic betrayal...

...more