The Nonlinear Library: Alignment Forum

AF - "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs"), published by Joe Carlsmith on November 29, 2023 on The AI Alignment Forum.
This is Section 2.2.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
"Clean" vs. "messy" goal-directedness
We've now discussed two routes to the sort of beyond-episode goals that might motivate scheming. I want to pause here to note two different ways of thinking about the type of goal-directedness at stake - what I'll call "clean goal-directedness" and "messy goal-directedness." We ran into these differences in the last section, and they'll be relevant in what follows as well.
I said in section 0.1 that I was going to assume that all the models we're talking about are goal-directed in some sense. Indeed, I think most discourse about AI alignment rests on this assumption in one way or another.
But especially in the age of neural networks, the AI alignment discourse has also had to admit a certain kind of agnosticism about the cognitive mechanisms that will make this sort of talk appropriate. In particular: at a conceptual level, this sort of talk calls to mind a certain kind of clean distinction between the AI's goals, on the one hand, and its instrumental reasoning (and its capabilities/"optimization power" more generally), on the other.
That is, roughly, we decompose the AI's cognition into a "goal slot" and what we might call a "goal-pursuing engine" - e.g., a world model, a capacity for instrumental reasoning, other sorts of capabilities, etc. And in talking about models with different sorts of goals - e.g., schemers, training saints, mis-generalized non-training-gamers, etc - we generally assume that the "goal-pursuing engine" is held roughly constant.
That is, we're mostly debating what the AI's "optimization power" will be applied to, not the sort of optimization power at stake. And when one imagines SGD changing an AI's goals, in this context, one mostly imagines it altering the content of the goal slot, thereby smoothly redirecting the "goal-pursuing engine" towards a different objective, without needing to make any changes to the engine itself.
But it's a very open question how much this sort of distinction between an AI's goals and its goal-pursuing-engine will actually be reflected in the mechanistic structure of the AI's cognition - the structure that SGD, in modifying the model, has to intervene on. One can imagine models whose cognition is in some sense cleanly factorable into a goal, on the one hand, and a goal-pursuing-engine, on the other (I'll call this "clean" goal-directedness).
But one can also imagine models whose goal-directedness is much messier - for example, models whose goal-directedness emerges from a tangled kludge of locally-activated heuristics, impulses, desires, and so on, in a manner that makes it much harder to draw lines between e.g. terminal goals, instrumental sub-goals, capabilities, and beliefs (I'll call this "messy" goal-directedness).
To be clear: I don't, myself, feel fully clear on the distinction here, and there is a risk of mixing up levels of abstraction (for example, in some sense, all computation - even the most cleanly goal-directed kind - is made up of smaller and more local computations that won't, themselves, seem goal-directed).
As another intuition pump, though: discussions of goal-directedness sometimes draw a distinction between so-called "sphex-ish" systems (that is, systems whose a...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners