February 14, 2024

AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley

47 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Requirements for a Basin of Attraction to Alignment, published by Roger Dearnaley on February 14, 2024 on The AI Alignment Forum.

TL;DR It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the problem of "you have to get everything about alignment exactly right on the first try, in case of fast takeoff".

I recently outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis the suggestion that for sufficiently capable agents this is in fact a property of any set of goals sufficiently close to alignment, basically because with enough information the AI can deduce or be persuaded of the need to perform value learning.

I'd now like to analyze this in more detail, breaking the argument that the AI would need to make for this into many simple individual steps, and detailing the background knowledge that would be required at each step, in order to try to estimate the amount and content of the information that the AI would require for it to be persuaded (or persuadable) by this argument, and thus for it be inside the basin of attraction.

I am aware that some of the conclusions of this post may be rather controversial.

I would respectfully ask that anyone who disagrees with it, do me and the community the courtesy of posting a comment explaining why it is incorrect, or if that is too time-consuming at least selecting a region of the text that they disagree with and then clicking the resulting smiley-face icon to select a brief icon/description of how/why, rather then simply down-voting this post just because you disagree with some of its conclusions.

(Of course, if you feel that this post is badly written, or poorly argued, or a waste of space, then please go right ahead and down-vote it - even if you agree with most or all of it.)

Why The Orthogonality Thesis Isn't a Blocker

The orthogonality thesis, the observation that an agent of any intelligence level can pursue any goal, is of course correct. However, while this thesis is useful to keep in mind to avoid falling into traps of narrow thinking, such as anthropomorphizing intelligent agents, it isn't actually very informative, and we can do better. The goals of intelligent agents that we are actually likely to encounter will tend to only occupy a small proportion of the space of all possible goals.

There are two interacting reasons for this:

Agents can only arise by evolution or by being deliberately constructed, i.e. by intelligent design. Both of these processes show strong and predictable biases in what kind of goals they tend to create agents with.

Evolutionary psychology tells us a lot about the former, and if the intelligent designer who constructed the agent was evolved, then a combination of their goals (as derived from evolutionary psychology) plus the aspects of Engineering relevant to the technology they used tells us a lot about the latter.

[Or, if the agent was constructed by another constructed agent, follow that chain of who-constructed-who back to the original evolved intelligent designer who started it, apply evolutionary psychology to them, and then apply an Engineering process repeatably. Each intelligent designer in that chain is going to be motivated to attempt to minimize the distortions/copying errors introduces at each Engineering step, i.e.

they will have intended to create something whose goals are aligned to their goals.]

Agents can cease to have a particular goal, either by themself ceasing to exist, or by being modified by themselves and/or others to now have a different goal. For example, an AI agent that optimizes the goal of the Butleria...

...more

View all episodes

By The Nonlinear Fund

February 14, 2024

AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley

47 minutes

I am aware that some of the conclusions of this post may be rather controversial.

(Of course, if you feel that this post is badly written, or poorly argued, or a waste of space, then please go right ahead and down-vote it - even if you agree with most or all of it.)

Why The Orthogonality Thesis Isn't a Blocker

There are two interacting reasons for this:

they will have intended to create something whose goals are aligned to their goals.]

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley

Sign up to save your podcasts

AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley

AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast