The Nonlinear Library: Alignment Forum

AF - The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs"), published by Joe Carlsmith on December 2, 2023 on The AI Alignment Forum.
This is Section 1.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy
So far, we've covered two of the three prerequisites for scheming I listed above: namely, situational awareness and beyond-episode goals. Let's turn to the third: namely, the model concluding that its beyond-episode goals will be better achieved if it optimizes for reward-on-the-episode - and in particular, that it, or some other AIs, will get more power if it does this. Should we expect situationally-aware models with beyond-episode goals to conclude this?
For now, let's assume that the model's conclusions in this respect are accurate (though: I'll talk about the possibility of relaxing this assumption below). This means that the question reduces to: will instrumental training-gaming, in fact, be the best strategy for achieving the model's beyond-episode goals, in virtue of the power it grants the model or some other AI?
I'm going to start by examining what I see as the story most often offered in support of this conclusion - what I'll call the "classic goal-guarding story."
The classic goal-guarding story
The classic goal-guarding story goes roughly as follows."[1]
Instrumental value of goal-content integrity: Because the model has a beyond-episode goal, it will typically promote that goal for agents pursuing it to survive and be empowered in the future.
The goal-guarding hypothesis. If the model optimizes for reward-on-the-episode during training, then the goals that survive training will be such that the model would want them promoted in the future; whereas if the model doesn't optimize for reward-on-the-episode during training, the model will want this less.
Adequate future empowerment. Conditional on surviving training to the degree at stake in (2), the model's goals will then end up empowered to a degree (and with sufficient probability) as to justify (given the model's other goals, its tolerance for risk, etc) pursuing a strategy of training-gaming overall.
(1), here, is a fairly general statement about the basic dynamic that underlies the classic goal-guarding story. I find it plausible in the context of the sort of "adequate future empowerment" at stake in (3), and I won't spend a lot of time on it here.[2]
Rather, I'll focus on (2) and (3) directly.
The goal-guarding hypothesis
We can distinguish two variants of the goal-guarding hypothesis - an extreme version, and a looser version.
The extreme version (what I'll call the "crystallization hypothesis") says that once a model starts training-gaming, its goals will basically stop changing, period - that is, they will "crystallize."
The looser version says that once a model starts training gaming, its goals might keep changing somewhat, but much less than they would've otherwise, and not enough for the classic goal-guarding story to fail overall.
The former might seem extreme, but some analysts explicitly appeal to something in the vicinity (see e.g. Hubinger here). It's also a cleaner focus of initial analysis, so I'll start there.
The crystallization hypothesis
As I understand it, the basic thought behind the crystallization hypothesis is that once a model is explicitly optimizing either for the specified goal, or for reward-on-the-episod...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners