The Nonlinear Library: Alignment Forum

AF - Weak vs Quantitative Extinction-level Goodhart's Law by Vojtech Kovarik


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Weak vs Quantitative Extinction-level Goodhart's Law, published by Vojtech Kovarik on February 21, 2024 on The AI Alignment Forum.
tl;dr: With claims such as "optimisation towards a misspecified goal will cause human extinction", we should be more explicit about the order of quantifiers (and the quantities) of the underlying concepts. For example, do we mean that that for every misspecified goal, there exists a dangerous amount of optimisation power? Or that there exists an amount of optimisation power that is dangerous for every misspecified goal? (Also, how much optimisation? And how misspecified goals?)
Central to the worries of about AI risk is the intuition that if we even slightly misspecify our preferences when giving them as input to a powerful optimiser, the result will be human extinction. We refer to this conjecture as Extinction-level Goodhart's Law[1].
Weak version of Extinction-level Goodhart's Law
To make Extinction-level Goodhart's Law slightly more specific, consider the following definition:
Definition 1: The Weak Version of Extinction-level Goodhart's Law is the claim that: "Virtually any[2] goal specification, pursued to the extreme, will result in the extinction[3] of humanity."[4]
Here, the "weak version" qualifier refers to two aspects of the definition. The first is the limit nature of the claim --- that is, the fact that the law only makes claims about what happens when the goal specification is pursued to the extreme. The second is best understood by contrasting Definition 1 with the following claim:
Definition 2: The Uniform Version of Extinction-level Goodhart's Law is the claim that: "Beyond a certain level of optimisation power, pursuing virtually any goal specification will result in the extinction of humanity."
In other words, the difference between Definitions 1 and 2 is the difference between
(goal G s.t. [conditions]) (opt. power O) : Optimise(G, O) extinction
(opt. power O) (goal G s.t. [conditions]) : Optimise(G, O) extinction.
Quantitative version of Goodhart's Law
While the distinction[5] between the weak and uniform versions of Extinction-level Goodhart's Law is important, it is too coarse to be useful for questions such as:
For a given goal specification, what are the thresholds at which (i) additional optimisation becomes undesirable, (ii) we would have been better off not optimising at all, and (iii) the optimisation results in human extinction?
Should we expect warning shots?
Does iterative goal specification work? That is, can we keep optimising until things start to get worse, then spend a bit more effort on specifying the goal, and repeat?
This highlights the importance of finding the correct quantitative version of the law:
Definition 3: A Quantitative Version of Goodhart's Law is any claim that describes the relationship between a goal specification, optimisation power used to pursue it, and the (un)desirability of the resulting outcome.
We could also look specifically at the quantitative version of extinction-level Goodhart's law, which would relate the quality of the goal specification to the amount of optimisation power that can be used before pursuing this goal specification results in extinction.
Implications for AI-risk discussions
Finally, note that the weak version of Extinction-level Goodhart's Law is consistent with AI not posing any actual danger of human extinction. This could happen for several reasons, including (i) because the dangerous amounts optimisation power are unreachable in practice, (ii) because we adopt goals which are not amenable to using unbounded (or any) amounts of optimisation[6], or (iii) because iterative goal specification is in practice sufficient for avoiding catastrophes.
As a result, we view the weak version of Extinction-level Goodhart's Law as a conservative claim, which the prop...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners