Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This post lays out a framework I’m currently using for thinking about when AI systems will seek power in problematic ways. I think this framework adds useful structure to the too-often-left-amorphous “instrumental convergence thesis,” and that it helps us recast the classic argument for existential risk from misaligned AI in a revealing way. In particular, I suggest, this recasting highlights how much classic analyses of AI risk load on the assumption that the AIs in question are powerful enough to take over the world very easily, via a wide variety of paths. If we relax this assumption, I suggest, the strategic trade-offs that an AI faces, in choosing whether or not to engage in some form of problematic power-seeking, become substantially more complex.
Prerequisites for rational takeover-seeking
For simplicity, I’ll focus here on the most extreme [...]
---
Outline:
(00:53) Prerequisites for rational takeover-seeking
(02:51) Agential prerequisites
(06:43) Goal-content prerequisites
(09:13) Takeover-favoring incentives
(13:38) Recasting the classic argument for AI risk using this framework
(26:13) What if the AI can’t take over so easily, or via so many different paths?
The original text contained 16 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---