The Nonlinear Library: Alignment Forum

AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Summing up "Scheming AIs" (Section 5), published by Joe Carlsmith on December 9, 2023 on The AI Alignment Forum.
This is Section 5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Summing up
I've now reviewed the main arguments I've encountered for expecting SGD to select a schemer. What should we make of these arguments overall?
We've reviewed a wide variety of interrelated considerations, and it can be difficult to hold them all in mind at once. On the whole, though, I think a fairly large portion of the overall case for expecting schemers comes down to some version of the "counting argument." In particular, I think the counting argument is also importantly underneath many of the other, more specific arguments I've considered. Thus:
In the context of the "training-game-independent proxy goal" argument: the basic worry is that at some point (whether before situational awareness, or afterwards), SGD will land naturally on a (suitably ambitious) beyond-episode goal that incentivizes scheming. And one of the key reasons for expecting this is just: that (especially if you're actively training for fairly long-term, ambitious goals), it seems like a very wide variety of goals that fall out of training could have this property.
In the context of the "nearest max-reward goal" argument: the basic worry is that because schemer-like goals are quite common in goal-space, some such goal will be quite "nearby" whatever not-yet-max-reward goal the model has at the point it gains situational awareness, and thus, that modifying the model into a schemer will be the easiest way for SGD to point the model's optimization in the highest-reward direction.
In the context of the "simplicity argument": the reason one expects schemers to be able to have simpler goals than non-schemers is that they have so many possible goals (or: pointers-to-goals) to choose from. (Though: I personally find this argument quite a bit less persuasive than the counting argument itself, partly because the simplicity benefits at stake seem to me quite small.)
That is, in all of these cases, schemers are being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming, thereby making it easier to (a) land on one of them naturally, (b) land "nearby" one of them, or (c) find one of them that is "simpler" than non-schemer goals that need to come from a more restricted space.
And in this sense, as I noted in the section 4.2, the case for schemers mirrors one of the most basic arguments for expecting misalignment more generally - e.g., that alignment is a very narrow target to hit in goal-space. Except, here, we are specifically incorporating the selection we know we are going to do on the goals in question: namely, they need to be such as to cause models pursuing them to get high reward. And the most basic worry is just that: this isn't enough. Still, despite your best efforts in training, and almost regardless of your reward signal, almost all the models you might've selected will be getting high reward for instrumental reasons - and specifically, in order to get power.
I think this basic argument, in its various guises, is a serious source of concern.
If we grant that advanced models will be relevantly goal-directed and situationally-aware, that a wide variety of goals would indeed lead to scheming, and that schemers would perform close-to-optimally in training, then on what grounds, exactly, w...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners