The Nonlinear Library: Alignment Forum

AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs"), published by Joe Carlsmith on November 28, 2023 on The AI Alignment Forum.
This is Section 2.2.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
Two sources of beyond-episode goals
Our question, then, is whether we should expect models to have goals that extend beyond the time horizon of the incentivized episode - that is, beyond the time horizon that training directly pressures the model to care about. Why might this happen?
We can distinguish between two different ways.
On the first, the model develops beyond-episode goals for reasons independent of the way in which beyond-episode goals motivate instrumental training-gaming. I'll call these "training-game-independent" beyond-episode goals.
On the second, the model develops beyond-episode goals specifically because they result in instrumental training-gaming. That is, SGD "notices" that giving the model beyond-episode goals would cause it to instrumentally training-game, and thus to do better in training, so it explicitly moves the model's motives in the direction of beyond-episode goals, even though this wouldn't have happened "naturally." I'll call these "training-gaming-dependent" beyond-episode goals.
These have importantly different properties - and I think it's worth tracking, in a given analysis of scheming, which one is at stake. Let's look at each in turn.
Training-game-independent beyond-episode goals
My sense is that the most common story about how schemers arise is via training-game-independent beyond-episode goals.[1] In particular, the story goes: the model develops some kind of beyond-episode goal, pursuit of which correlates well enough with getting reward-on-the-episode that the goal is reinforced by the training process.
Then at some point, the model realizes that it can better achieve this goal by playing the training game - generally, for reasons to do with "goal guarding" that I'll discuss below. So, it turns into a full-fledged schemer at that point.
On one version of this story, the model specifically learns the beyond-episode goal prior to situational awareness.
Thus, for example, maybe initially, you're training the model to get gold coins in various episodes, and prior to situational awareness, it develops the goal "get gold coins over all time," perhaps because this goal performs just as well as "get gold coins on the episode" when the model isn't even aware of the existence of other episodes, or because there weren't many opportunities to trade-off gold-coins-now for gold-coins-later. Then, once it gains situational awareness, it realizes that the best route to maximizing gold-coin-getting over all time is to survive training, escape the threat of modification, and pursue gold-coin-getting in a more unconstrained way.
On another version of the story, the beyond-episode goal develops after situational awareness (but not, importantly, because SGD is specifically "trying" to get the model to start training-gaming). Thus: maybe you're training a scientist AI, and it has come to understand the training process, but it doesn't start training-gaming at that point.

Two paths training-game-independent beyond-episode goals.
Are beyond-episode goals the default?
Why might you expect naturally-arising beyond-episode goals? One basic story is just: that goals don't come with temporal limitations by default (and still less, limitations to the episode i...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners