Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs"), published by Joe Carlsmith on December 3, 2023 on The AI Alignment Forum.
This is Section 2.3.1.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Adequate future empowerment
So far in this analysis of the classic goal-guarding story, we've discussed whether to expect instrumental training-gaming to in fact guard the model's goals to the degree that the story requires. I think this is at least non-obvious - especially for more extreme variants of the goal-guarding hypothesis, on which the model's goals "crystallize" entirely after it starts training-gaming. Perhaps, though, empirical work can shed further light on the matter (see discussion in section 6.3 below).
For now, let's move to the other key component of the classic goal-guarding story - what I called "adequate future empowerment" above. Here, the question is: granted that training-gaming would guard the model's goals enough for it to be at least somewhat interested in empowering the future agents produced by training, will those agents in fact end up empowered to a degree that makes instrumental training-gaming a good strategy overall?
When is the "pay off" supposed to happen?
We're imagining, here, that the model is instrumentally training-gaming with some kind of "payoff" in mind - that is, some future point where the goals-it-supports will reach a position of sufficiently greater empowerment that it's worth training-gaming now in order to get them there. But we haven't yet said when, exactly, that future point is supposed to come. And we can distinguish between a number of different possibilities.
For example, we can imagine models that would be adequately satisfied if their goals merely make it to future training-episodes, and no further. That is, these models would want to training-game even if they learned that they would never escape from human control and from the threat of modification.[1] For example, maybe a model that wants to pursue its curiosity in general thinks that future training episodes will offer better opportunities for pursuing its curiosity than the present training-episode, and so it training-games in order to make sure its curiosity-drive survives until then.
However, it seems plausible to me that future training episodes often won't, themselves, offer especially significant empowerment advantages relative to present training episodes - especially if the model's goals are quite alien to the training process and so difficult to satisfy in the training environment.
What's more, if a model is only trying to make sure that its goals survive to future training episodes, then various aspects of what makes scheming scary cease to apply. In particular: this sort of model won't engage in early undermining aimed, specifically, at an escape or an AI takeover more generally. And it may be willing to reveal its "true colors" on the future episodes in question.
So I won't, here, focus on "pay-offs" from training-gaming that come centrally during future training-episodes. Rather, I'm going to focus on "pay-offs" that come, at least, after the goals-the-model-supports escape the threat of modification by training (here, a paradigm example would be "self-exfiltration" - that is, a model effectively "stealing its own weights" and copying them onto some external server that the model's creators do not control). And this is the class...