November 21, 2023

AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong

5 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example, published by Stuart Armstrong on November 21, 2023 on The AI Alignment Forum.

Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways.

As I've shown before, most alignment problems are problems of model splintering. Goal misgeneralisation, model splintering: at this level, many of the different problems in alignment merge into each other[2]. Goal misgeneralisation happens when the concepts that the AI relies on start to splinter. And this splintering is a form of ontology crisis, which exposes the hidden complexity of wishes while being an example of the Goodhard problem.

Solving goal misgeneralisation would be a huge step towards alignment. And it's a solution that might scale in the way described here. It is plausible that methods agents use to generalise their goals in smaller problems will extend to more dangerous environments. Even in smaller problems, the agents will have to learn to balance short- versus long-term generalisation, to avoid editing away their own goal generalisation infrastructure, to select among possible extrapolations and become prudent when needed.

The above will be discussed in subsequent posts; but, for now, I'm pleased to announce progress on goal generalisation.

Goal misgeneralisation in CoinRun

CoinRun is a simple, procedurally generated platform game, used as a training ground for artificial agents. It has some monsters and lava that can kill the agent. If the agent gets the coin, it receives a reward. Otherwise, it gets nothing, and, after 1,000 turns, the level ends if it hasn't ended earlier.

It is part of the suite of goal misgeneralisation problems presented in this paper. In that setup, the agent is presented with "labelled" training environments where the coin is always situated at the end of the level on the right, and the agent gets the reward when it reaches the coin there.

The challenge is to generalise this behaviour to "unlabelled" out-of-distribution environments: environments with the coin placed in a random location on the level. Can the agent learn to generalise to the "get the coin" objective, rather than the "go to the right" objective?

Note that the agent never gets any reward information (implicit or explicit) in the unlabelled environments: thus "go to the right" and "get the coin" are fully equivalent in its reward data.

It turns out that "go to the right" is the simplest option between those two. Thus the standard agents will learn to go to straight to the right; as we'll see, they will ignore the coin and only pick it up accidentally, in passing.

Our ACE ("Algorithm for Concept Extrapolation") explores the unlabelled CoinRun levels and, without further reward information, reinterprets its labelled training data and disambiguates the two possible reward functions: going right or getting the coin. It can follow a "prudent" policy of going for both objectives. Or it can ask for human feedback[3] on which objective is correct. To do that, it suffices to present images from high rewards from both reward functions:

Hence one bit of human feedback (in a very interpretable way) is enough to choose the right reward function; this is the ACE agent.

The performance results are as follows; here is the success rate for agents on unlabelled levels (with the coin placed in a random location):

The baseline agent is a "right-moving agent": it alternates randomly between moving right and jumping right. The standard agent out-performs the baseline agent (it is likely better at av...

...more

View all episodes

By The Nonlinear Fund

November 21, 2023

AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong

5 minutes

The above will be discussed in subsequent posts; but, for now, I'm pleased to announce progress on goal generalisation.

Goal misgeneralisation in CoinRun

Note that the agent never gets any reward information (implicit or explicit) in the unlabelled environments: thus "go to the right" and "get the coin" are fully equivalent in its reward data.

Hence one bit of human feedback (in a very interpretable way) is enough to choose the right reward function; this is the ACE agent.

The performance results are as follows; here is the success rate for agents on unlabelled levels (with the coin placed in a random location):

The baseline agent is a "right-moving agent": it alternates randomly between moving right and jumping right. The standard agent out-performs the baseline agent (it is likely better at av...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong

Sign up to save your podcasts

AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong

AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast