January 16, 2023

AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang

28 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Experiment Idea: RL Agents Evading Learned Shutdownability, published by Leon Lang on January 16, 2023 on The AI Alignment Forum.

Preface

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. Thanks to Erik Jenner who explained to me the basic intuition for why an advanced RL agent may evade the discussed corrigibility measure. I also thank Alex Turner, Magdalena Wache, and Walter Laurito for detailed feedback on the proposal and Quintin Pope and Lisa Thiergart for helpful feedback in the last December SERI-MATS shard theory group meeting.

This text was part of my deliverable for the shard theory stream of SERI-MATS. In it, I present an idea for an experiment that tests the convergent drive of modern model-based RL agents to evade shutdownability. If successful, I expect the project could serve as a means to communicate the problem of corrigibility to the machine learning community. As such, I also consider this project idea a submission of the Shutdown Problem Contest.

I do not personally want to work on the project since doing experiments does not seem like my comparative advantage. Thus, with posting this project, I mainly seek collaborators and feedback. More on that in the conclusion.

Introduction

At some point in the future, we will train very advanced AI, possibly by some version of model-based reinforcement learning. Since it is hard to correctly specify the reward function or steer the inner learning dynamics of the agent, it may be desirable to make sure that the agent is corrigible, including shutdownability:

“(1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.”

Next to a literal shutdown button, one could also try to use the inner functions of the agent to effectively get the ability to shut it down. The idea would be to design a certain type of input such that, as a reaction to that input, a computation in the agent itself will cause it to perform a null action that doesn’t do anything — which effectively equates to a shutdown. Scott Aaronson proposed this idea in a recent talk (timestamp: 57:10) in the form of a planted cryptographic backdoor, an idea he currently works on during his sabbatical at OpenAI.

For this proposal, I consider a variation of that shutdown approach: instead of directly “planting” the shutdown behavior, I consider the possibility to train the behavior into the agent. Concretely, there is a specific input — here, an alert sound — that the agent receives at random times during training. Whenever that happens, the agent will receive a negative reward unless it performs the null action. Over time, the agent will learn to stop acting in these situations. During deployment, the input can then be shown to the agent whenever it performs harmful actions, which will ideally cause it to stop immediately.

This shutdown approach has the following philosophical problem: if the agent performs world modeling, it will know that the alert sound will make it stop performing, and this may be counter to the more consequentialist motivations of the agent. Thus, we should expect that the agent will perform an action that makes it never hear the sound again — for example, by “putting in earplugs” — once it is advanced enough and has the ability to do so. This experimental proposal is meant to show this convergent behavior using contemporary model-based reinforcement learning.

The remaining sections will discuss the following:

A three-part training story for an advanced AI that evades corrigibility, including details for how to model the situation with contemporary reinforcement learning;

More experimental details based on using MuZero;

Caveats, what might go wrong;

A short comparison to the original shutdown problem and how the proposal...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 16, 2023

AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang

28 minutes

Preface

Introduction

“(1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.”

The remaining sections will discuss the following:

A three-part training story for an advanced AI that evades corrigibility, including details for how to model the situation with contemporary reinforcement learning;

More experimental details based on using MuZero;

Caveats, what might go wrong;

A short comparison to the original shutdown problem and how the proposal...

...more

Share AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang

Sign up to save your podcasts

AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang

AF - Experiment Idea: RL Agents Evading Learned Shutdownability by Leon Lang