The Nonlinear Library: Alignment Forum

AF - Different views of alignment have different consequences for imperfect methods by Stuart Armstrong


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Different views of alignment have different consequences for imperfect methods, published by Stuart Armstrong on September 28, 2023 on The AI Alignment Forum.
Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI's optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the AI's power rises, the alignment methods must be better and better. Alignment is thus a dam that has to be tall enough and sturdy enough. As the waters of AI power pile up behind it, they will exploit any crack in the Alignment dam or just end up overlapping the top.
So assume A is an Alignment method that works in environment E (where "environment" includes the physical setup, the AI's world-model and knowledge, and the AI's capabilities in general). Then we expect that there is an environment E' where A fails - when the AI's capabilities are no longer constrained by A.
Now, maybe there is a better alignment method A' that would work in E', but there's likely another environment E'' where A' fails. So unless A is "almost perfect", there will always be some environment when it will fail.
So we need A to be "almost perfect". Furthermore, in the conventional view, you can't get there by combining imperfect methods - "belt and braces" doesn't work. So if U is a utility function that partially defines human flourishing but is missing some key elements, and if B is a box that contains the AI so that it can only communicate with humans via text interfaces, then U+B is not much more of constraint than U and B individually. Most AIs that are smart enough to exploit U and break B, can get around U+B.
A consequence of that perspective is that imperfect methods are of little use for alignment. Since you can't get "almost perfect" by adding up imperfect methods, there's no point in developing imperfect methods.
Concept extrapolation/model splintering[1] has a different dynamic[2]. Here the key idea is to ensure that the alignment method extends safely across an environment change. So if the AI starts with alignment method A in environment E, and then moves to environment E', we must ensure that A transitions to A', where A' ensures that the AI doesn't misbehave in environment E'. Thus the key is some 'meta alignment' method MA that manages[3] the transition to the new environment[4].
An important difference with the standard alignment picture is that imperfect methods are not necessarily useless. If method MA1 helps manage the transition between E and E', it may also work for transitioning between E1,000,000 and E1,000,001. And "belts and braces" may work: even if neither MA1 nor MA2 work between E1,000,000 and E1,000,001, maybe MA1+MA2 can work.
As an example of the first phenomena, Bayes' theorem can be demonstrated in simple diagrams, but continues to be true in very subtle and complicated situations. The case of MA1+MA2 can be seen by considering them as subsets of human moral principles; then there are some situations where MA1 is enough to get an answer ("a good thing for two beings is worth twice as much as a good thing for only one being"), some where MA2 is enough ("the happiness of a human is worth 10 times that of a dog"), and some where both sets are needed ("compare the happiness of two humans versus 19 times that happiness for a dog").
So the requirements for successful contribution to concept extrapolation approaches are lower than for more classical alignment methods: less perfection may be required.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners