The Nonlinear Library: Alignment Forum

AF - Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs") by Joe Carlsmith


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs"), published by Joe Carlsmith on November 24, 2023 on The AI Alignment Forum.
This is Sections 1.3 and 1.4 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast aapp.
Why focus on schemers in particular?
As I noted above, I think schemers are the scariest model class in this taxonomy.[1] Why think that? After all, can't all of these models be dangerously misaligned and power-seeking? Reward-on-the-episode seekers, for example, will plausibly try to seize control of the reward process, if it will lead to more reward-on-the-episode.
This section explains why. However, if you're happy enough with the focus on schemers, feel free to skip ahead to section 1.4.
The type of misalignment I'm most worried about
To explain why I think that schemers are uniquely scary, I want to first say a few words about the type of misalignment I'm most worried about.
First: I'm focused, here, on what I've elsewhere called "practical power-seeking-alignment" - that is, on whether our AIs will engage in problematic forms of power-seeking on any of the inputs they will in fact receive. This means, importantly, that we don't need to instill goals in our AIs that lead to good results even when subject to arbitrary amounts of optimization power (e.g., we don't need to pass Yudkowsky's "omni test"). Rather, we only need to instill goals in our AIs that lead to good results given the actual options and constraints those AIs will face, and the actual levels of optimization power they will be mobilizing.
This is an importantly lower bar. Indeed, it's a bar that, in principle, all of these models (even schemers) can meet, assuming we control their capabilities, options, and incentives in the right way. For example, while it's true that a reward-on-the-episode seeker will try to seize control of the reward process given the opportunity, one tool in our toolset is: to not give it the opportunity. And while a paradigm schemer might be lying in wait, hoping one day to escape and seize power (but performing well in the meantime), one tool in our tool box is: to not let it escape (while continuing to benefit from its good performance).
Of course, success in this respect requires that our monitoring, control, and security efforts be sufficiently powerful relative to the AIs we're worried about, and that they remain so even as frontier AI capabilities scale up. But this brings me to my second point: namely, I'm here especially interested in the practical PS-alignment of some comparatively early set of roughly human-level - or at least, not-wildly-superhuman - models.
Defending this point of focus is beyond my purpose here. But it's important to the lens I'll be using in what follows.
In particular: I think it's plausible that there will be some key (and perhaps: scarily brief) stage of AI development in which our AIs are not yet powerful enough to take-over (or to escape from human control), but where they are still capable, in principle, of performing extremely valuable and alignment-relevant cognitive work for us, if we can successfully induce them to do so. And I'm especially interested in forms of misalignment that might undermine this possibility.
Finally: I'm especially interested in forms of PS-misalignment in which the relevant power-seeking AIs are specifically aiming either to cause, participate in, or benefit from some kind of full-blown disempowerment of ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners