Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extinction-level Goodhart's Law as a Property of the Environment, published by Vojtech Kovarik on February 21, 2024 on The AI Alignment Forum.
Summary: Formally defining Extinction-level Goodhart's Law is tricky, because formal environments don't contain any actual humans that could go extinct. But we can do it using the notion of an interpretation mapping, which sends outcomes in the abstract environment to outcomes in the real world. We can then the truth condition of Extinction-level Goodhart's Law as a property of the environment.
I conjecture that Extinction-level Goodhart's Law does not hold in easily formalisable environments[1], even though it might hold in the real world. This seems like a (very) big deal for AI advocacy, since it suggests that the lack of rigorous arguments concerning AI risk (eg, math proofs) does not provide strong evidence for the safety of AI.
Semi-formal definition of Extinction-level Goodhart's Law
Informally, we can define the extinction-level[2] variant of Goodhart's law as follows:
Definition (informal): The Weak Version[3] of Extinction-level Goodhart's Law is the claim that: "Virtually any goal specification, pursued to the extreme, will result in the extinction of humanity."
The tricky part is how to translate this into a more rigorous definition that can be applied to formal environments.
Defining "extinction" in formal environments
However, applying this definition to formal environments is tricky, because it requires formally defining which of the abstract states qualify as "extinction of humanity". How can we get past this obstacle?
What we don't do: "extinction states" given by definition
A lazy way of defining extinction in abstract models would be to assume that we are given some such "extinction" states by definition. That is, if Ω is the set of all possible states in the formal environment, we could assume that there is some set ΩextinctionΩ. And we would refer to any ωΩextinction as "extinction".
I don't like this approach, because it just hides the problem elsewhere. Also, this approach does not put any constraints on how Ωextinction should look like. As a result, we would be unlikely to be able to derive any results.
The approach we take instead is to augment the formal environment by an interpretation, where each state "abstract" state ω in the formal environment Ω is mapped onto some state φ(ω) in some "more complex" environment Ω'.[4]
Real-world interpretations
The typical use of an abstract model is that we use it to study some more complex thing. While doing this, we implicitly hold in mind some interpretation of elements of the abstract model. For example, when using arithmetics to reason about apples in a box, I might interpret nN as "there are n apples in the box" and n+m as "there are n apples in the box, and I put another m in there" (or whatever).
Naturally, an abstract model can have any number of different interpretations --- however, to avoid imprecision (and the motte-and-bailey fallacy), we will try to focus on a single[5] interpretation at any given time.
Note that we intend interpretation to be a very weak notion --- interpretations are not meant to automatically be accurate, sensible, etc.
Definition 1 (interpretation, formal): Let a set Ω be the state space of some formal model. A (formal) interpretation of Ω is a pair (Ω',φ), where Ω' is a set representing the state space of some other formal model and φ:ΩΩ' is the interpretation function.
The definition only talks about the interpretation of states, but we could similarly talk about the interpretations of actions, features of states, etc.[6]
Definition 2 (real-world interpretation, semi-formal): Let a set Ω be the state space of some formal model. A real-world interpretation is a some φ:ΩΩ', where Ω' are the possible states of the real world.
No...