The Nonlinear Library: Alignment Forum

AF - Situational awareness (Section 2.1 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Situational awareness (Section 2.1 of "Scheming AIs"), published by Joe Carlsmith on November 26, 2023 on The AI Alignment Forum.
This is Section 2.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section (here)[https://joecarlsmithaudio.buzzsprout.com/2034731/13984823-situational-awareness-section-2-1-of-scheming-ais], or search "Joe Carlsmith Audio" on your podcast app.
What's required for scheming?
Let's turn, now, to examining the probability that baseline ML methods for training advanced AIs will produce schemers. I'll begin with an examination of the prerequisites for scheming. I'll focus on:
Situational awareness: that is, the model understands that it's a model in a training process, what the training process will reward, and the basic nature of the objective world in general.[1]
Beyond-episode goals: that is, the model cares about the consequences of its actions after the episode is complete.[2]
Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy: that is, the model believes that its beyond-episode goals will be better achieved if it optimizes for reward-on-the-episode - and in particular, that it, or some other AIs, will get more power if it does this.[3]
Situational awareness
Will models have situational awareness? Let's distinguish between two broad sorts of information at stake in such awareness:
General information about the objective world, including e.g. information about how machine learning training works.
"Self-locating" information: that is, information that locates the model in the objective world, and tells it facts about its own situation in particular - e.g., that it is this sort of model, that it's being trained on this particular reward signal, at this particular lab, during this particular time period, etc.[4] (Though: note that it's not clear how much of this sort of information is necessary to start scheming.
It seems very plausible that even somewhat-better-than-human models will absorb huge amounts of general information about the objective world, and develop detailed, mechanistic models of how it works. Indeed, current models already have access to vast quantities of information via the pre-training data - including information about machine learning in particular. And their ability to model the world mechanistically, to make inferences, to draw conclusions they haven't "memorized," and so on, seems to be improving rapidly.
What's more, while one can in principle try to specifically prevent models from gaining certain types of information about the objective world (e.g., by excluding certain kinds of information from the training data), this isn't the current default in training, and various kinds of information can be fairly important to the task you want the model to perform. And the more sophisticated the models are, the more difficult it is to ensure that they can't infer the information you're trying to hide on the basis of the information you do give them.
Do the same sort of considerations apply to self-locating information? I tend to think: yes. But it's at least somewhat less clear.
For example, while language model pre-training data will, by default, include a lot of information about language models and how they are trained (because such information is widely available on the internet), it's less clear how much information it will give the model about its situation in particular - or even, whether the pre-training next-token-prediction task will incentivize the model to have much...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners