Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dynamics Crucial to AI Risk Seem to Make for Complicated Models, published by Vojtech Kovarik on February 21, 2024 on The AI Alignment Forum.
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?.
tl;dr: In AI safety, we are worried about certain problems with using powerful AI. (For example, the difficulty of value specification, instrumental convergence, and the possibility that a misaligned AI will come up with takeover strategies that didn't even occur to us.) To study these problems or convince others that they are real, we might wish to describe them using mathematical models.
However, this requires using models that are sufficiently rich that these problems could manifest in the first place.
In this post, I suggest thinking about how such "rich-enough" models could look like. Also, I raise the possibility that models which are rich enough to capture problems relevant to AI alignment might be too complex to be amenable to a rigorous analysis.
Epistemic status: Putting several related observations into one place. But I don't have strong opinions on what to make of them.
In the previous post, I talked about "straightforwardly evaluating" arguments by modelling the dynamics described in those arguments. In this post, I go through some dynamics that seem central to AI risk. However, none of these dynamics is meant to be novel or surprising. Instead, I wish to focus on the properties of the mathematical models that could capture these dynamics.
How do such models look like? How complicated are they? And --- to the extent that answering some questions about AI risk requires modeling the interplay between multiple dynamics --- is there some minimal complexity of models which can be useful for answering those questions?
Laundry List of Dynamics Closely Tied to AI Risk
In this section, I list a number of dynamics that seem closely tied to AI risk, roughly[1] grouped based on which part of the "AI risk argument" they relate to. Below each part of this list, I give some commentary on which models might be useful for studying the given dynamics. I recommend reading selected parts that seem interesting to you, rather than going through the whole text.
For the purpose of skimming, here is a list of the dynamics, without any explanations:
I. Difficulty of specifying our preferences[2]:
Human preferences are ontologically distant from the laws of physics.
Human preferences are ontologically distant from the language we use to design the AI.
Laws of physics are unknown.
Human preferences are unknown.
II. Human extinction as a convergent byproduct of terminal goals[3]:
The world is malleable.
The world is made of resources.
Humans evolved to require a narrow range of environment conditions
III. Human extinction as a convergently instrumental subgoal[4]:
The environment has been optimised for our preferences.
Humans are power-seeking.
Power is, to some extent, zero-sum.
IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]:
Specifying restrictions is difficult for the same reasons that value specification is difficult.
The AI can act by proxy.
The AI can exploit novel strategies and technologies.
The AI, and everything constraining it, is fully embedded in the environment.
I. Difficulty of specifying our preferences[2]
A key part of worries about AI risk is that formally writing down what we want --- or even somehow indirectly gesturing at it --- seems exceedingly difficult. Some issues that are related to this are:
Concepts that are relevant for specifying our preferences (e.g., "humans'" and "alive'') on the one hand, and concepts that are primitive in the environment (e.g., laws of physics) on the other, are separated by many levels of abstraction.
Consider the ontology of our agents (e.g., the format of their input/o...