June 11, 2024

AF - Corrigibility could make things worse by ThomasCederborg

9 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Corrigibility could make things worse, published by ThomasCederborg on June 11, 2024 on The AI Alignment Forum.

Summary: A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse. Any implemented Corrigibility method will necessarily be built on top of a set of unexamined implicit assumptions. One of those assumptions could be true for a PAAI, but false for a CEV style AI. The present post outlines one specific scenario where this happens.

This scenario involves a Corrigibility method that only works for an AI design, if that design does not imply an identifiable outcome. The method fails when it is applied to an AI design, that does imply an identifiable outcome. When such an outcome does exist, the ''corrigible'' AI will ''explain'' this implied outcome, in a way that makes the designers want to implement that outcome.

The example scenario:

Consider a scenario where a design team has access to a Corrigibility method that works for a PAAI design. A PAAI can have a large impact on the world. For example by helping a design team prevent other AI projects. But there exists no specific outcome, that is implied by a PAAI design. Since there exists no implied outcome for a PAAI to ''explain'' to the designers, this Corrigibility method actually renders a PAAI genuinely corrigible.

For some AI designs, the set of assumptions that the design is built on top of, does however imply a specific outcome. Let's refer to this as the Implied Outcome (IO). This IO can alternatively be viewed as: ''the outcome that a Last Judge would either approve of, or reject''. In other words: consider the Last Judge proposal from the CEV arbital page. If it would make sense to add a Last Judge of this type, to a given AI design, then that AI design has an IO.

The IO is the outcome that a Last Judge would either approve of, or reject (for example a successor AI that will either get a thumbs up or a thumbs down). In yet other words: the purpose of adding a Last Judge to an AI design, is to allow someone to render a binary judgment on some outcome. For the rest of this post, that outcome will be referred to as the IO of the AI design in question.

In this scenario, the designers first implement a PAAI that buys time (for example by uploading the design team). For the next step, they have a favoured AI design, that does have an IO. One of the reasons that they are trying to make this new AI corrigible, is that they can't calculate this IO. And they are not certain that they actually want this IO to be implemented.

Their Corrigibility method always results in an AI that wants to refer back to the designers, before implementing anything. The AI will help a group of designers implement a specific outcome, iff they are all fully informed, and they are all in complete agreement that this outcome should be implemented. The Corrigibility method has a definition of Unacceptable Influence (UI). And the Corrigibility method results in an AI that genuinely wants to avoid exerting any UI.

It is however important that the AI is able to communicate with the designers in some way. So the Corrigibility method also includes a definition of Acceptable Explanation (AE).

At some point the AI becomes clever enough to figure out the details of the IO. At that point, it is clever enough to convince the designers that this IO is the objectively correct thing to do, using only methods classified as AE. This ''explanation'' is very effective and results in a very robust conviction, that the IO is the objectively correct thing to do. In particular, this value judgment does not change, when the AI tells the designers what has happened.

So, when the AI explains what has happened, the designers do not change their mind about IO. They still consider themselves to have a duty...

...more

View all episodes

By The Nonlinear Fund

June 11, 2024

AF - Corrigibility could make things worse by ThomasCederborg

9 minutes

The example scenario:

It is however important that the AI is able to communicate with the designers in some way. So the Corrigibility method also includes a definition of Acceptable Explanation (AE).

So, when the AI explains what has happened, the designers do not change their mind about IO. They still consider themselves to have a duty...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Corrigibility could make things worse by ThomasCederborg

Sign up to save your podcasts

AF - Corrigibility could make things worse by ThomasCederborg

AF - Corrigibility could make things worse by ThomasCederborg

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast