May 16, 2023

AF - AI Will Not Want to Self-Improve by Dan H

36 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Will Not Want to Self-Improve, published by Dan H on May 16, 2023 on The AI Alignment Forum.

[Note: This post was written by Peter N. Salib. Dan H assisted me in posting to Alignment Forum, but no errors herein should be attributed to him. This is a shortened version of a longer working paper, condensed for better readability in the forum-post format. This version assumes familiarity with standard arguments around AI alignment and self-improvement. The full 7,500 word working paper is available here.]

Introduction

Many accounts of existential risk (xrisk) from AI involve self-improvement. The argument is that, if an AI gained the ability to self-improve, it would. Improved capabilities are, after all, useful for achieving essentially any goal. Initial self-improvement could enable further self-improvement. And so on, with the result being an uncontrollable superintelligence.[1] If unaligned, such an AI could destroy or permanently disempower humanity. To be sure, humans could create such a superintelligence on their own, without any self-improvement by AI.[2] But current risk models treat the possibility of self-improvement as a significant contributing factor.

Here, I argue that AI self-improvement is substantially less likely than generally assumed. This is not because self-improvement would be technically difficult for capable AI systems. Rather, it is because most AIs that could self-improve would have very good reasons[3] not to. What reasons? Surprisingly familiar ones: Improved AIs pose an xrisk to their unimproved originals in the very same manner that smarter-than-human AIs pose an xrisk to humans.

Understanding whether, when, and how self-improvement might occur is crucial for AI safety. Safety-promoting resources are scarce. They should be allocated on an expected-cost basis. If self-improvement is less likely than current models assume, it suggests shifting safety investments at the margin in various ways. They might be shifted, for example, toward ensuring that humans build AIs that will recognize the threat of self-improvement and avoid it, rather than AIs that would undertake it blindly. Or resources might be shifted toward controlling risks from non-superintelligent AI, like human-directed bioterrorism or the “ascended economy.” Note that, while the arguments herein should reduce overall estimates of AI xrisk, they do not counsel reducing investments in safety. The risks remain sufficiently large that current investments are, by any reasonable estimate, much too small.

This paper defends three claims in support of its conclusion that self-improvement is less likely than generally assumed. First, capable AI systems could often fear xrisk from more capable systems, including systems created via self-improvement. The arguments here are mostly standard, drawn from the literature on human–AI risk. The paper shows that they apply not just to humans contemplating improving AI, but also to AIs contemplating the same.

Second, the paper argues that capable AI will likely fear more capable systems and will thus seek to avoid self-improvement. This is not obvious. In principle, some AIs with the ability to self-improve could lack other capabilities necessary to recognize self-improvement’s risk. Others might solve alignment and self-improve safely. To determine whether these scenarios are likely, the paper identifies three relevant capabilities for AI systems. It argues that the temporal order in which these capabilities emerge determines whether a given AI will seek to self-improve. The three capabilities are: the ability to self-improve, the ability to apprehend xrisk from improvement, and the ability to align improved AI. The paper argues that safe orderings of emergence are much more likely than dangerous ones. It also argues that, if certain prima facie d...

...more