The Nonlinear Library

AF - Shutdown-Seeking AI by Simon Goldstein


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shutdown-Seeking AI, published by Simon Goldstein on May 31, 2023 on The AI Alignment Forum.
This is a draft written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, and Pamela Robinson, postdoctoral research fellow at the Australian National University, as part of a series of papers for the Center for AI Safety Philosophy Fellowship's midpoint.Abstract: We propose developing AIs whose only final goal is being shut down. We argue that this approach to AI safety has three benefits: (i) it could potentially be implemented in reinforcement learning, (ii) it avoids some dangerous instrumental convergence dynamics, and (iii) it creates trip wires for monitoring dangerous capabilities. We also argue that the proposal can overcome a key challenge raised by Soares et al 2015, that shutdown-seeking AIs will manipulate humans into shutting them down. We conclude by comparing our approach with the corrigibility framework in Soares et al 2015.
1. Introduction
If intelligence is measured as the ability to optimize for a goal, then it is important that highly intelligent agents have good goals. This is especially important for artificial general intelligence (AGI), AIs capable of long-term, strategic planning across a wide range of tasks. AGIs may be very good at achieving their goals. This in itself doesn’t seem scary, for there appear to be plenty of safe goals to choose from. Solving a math problem or producing paperclips don’t look like dangerous goals. But according to the instrumental convergence thesis, an AGI will likely pursue unsafe sub-goals as effective means to achieving any goal. For example, acquiring more computational power is a nearly universal means to almost anything.
A dominant AI safety strategy is goal engineering: the attempt to construct a goal that would be safe for AGIs to have. (We will always use ‘goal’ to mean final goal, and ‘sub-goal’ otherwise.) A popular approach to goal engineering is goal alignment: the attempt to construct a goal that matches or is ‘aligned with’ our own goals. For example, Russell 2019, 2020 proposes AI agents that have the goal of achieving our goals, but are initially uncertain about what our goals are.
This paper explores an opposing approach that we call ‘beneficial goal misalignment’. On the goal alignment approach, the safe, aligned goal is difficult to specify and difficult to reach. This is because the aligned goal is closely tied to our own ultimate goals. In contrast, on the beneficial goal misalignment approach, the goal is easy to specify and intrinsically easy to reach. Because it is easy to reach, there is no need for an AGI to pursue unsafe sub-goals in order to reach it. There would normally be nothing to gain from designing an AGI with this kind of goal–that is, a goal that is safe and easily-reached but likely of no use to us. However, the key insight is that we can arrange things so that the AGI cannot reach this safe goal unless it first reaches a sub-goal that benefits us.
In particular, we propose developing AIs that have a single final goal: the goal of being shut down. To make the AI useful, we propose creating barriers to shutdown, which are removed after the AI completes tasks for humans. In section 3, we’ll argue that this kind of shutdown-seeking agent offers three safety benefits. First, it helps with the ‘specification problem’ in reinforcement learning: (i) shutdown is an easier goal to define than plausible alternatives, and (ii) there are ways to design a reward function that rewards being shut down. Second, shutdown-seeking AIs are less likely to engage in dangerous behavior as a result of instrumental convergence. Whereas a paperclip maximizer might try to gather resources, improve itself, and take measures to avoid being turned off (see Omohundro 2008), ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings