November 11, 2023

LW - GPT-2030 and Catastrophic Drives: Four Vignettes by jsteinhardt

24 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT-2030 and Catastrophic Drives: Four Vignettes, published by jsteinhardt on November 11, 2023 on LessWrong.

I previously discussed the capabilities we might expect from future AI systems, illustrated through GPT2030, a hypothetical successor of GPT-4 trained in 2030. GPT2030 had a number of advanced capabilities, including superhuman programming, hacking, and persuasion skills, the ability to think more quickly than humans and to learn quickly by sharing information across parallel copies, and potentially other superhuman skills such as protein engineering. I'll use "GPT2030++" to refer to a system that has these capabilities along with human-level planning, decision-making, and world-modeling, on the premise that we can eventually reach at least human-level in these categories.

More recently, I also discussed how misalignment, misuse, and their combination make it difficult to control AI systems, which would include GPT2030. This is concerning, as it means we face the prospect of very powerful systems that are intrinsically difficult to control.

I feel worried about superintelligent agents with misaligned goals that we have no method for reliably controlling, even without a concrete story about what could go wrong. But I also think concrete examples are useful. In that spirit, I'll provide four concrete scenarios for how a system such as GPT2030++ could lead to catastrophe, covering both misalignment and misuse, and also highlighting some of the risks of economic competition among AI systems. I'll specifically argue for the plausibility of "catastrophic" outcomes, on the scale of extinction, permanent disempowerment of humanity, or a permanent loss of key societal infrastructure.

None of the four scenarios are individually likely (they are too specific to be). Nevertheless, I've found discussing them useful for informing my beliefs. For instance, some of the scenarios (such as hacking and bioweapons) were more difficult than expected when I looked into the details, which moderately lowered the probability I assign to catastrophic outcomes. The scenarios also cover a range of time scales, from weeks to years, which reflects real uncertainty that I have.

This post is a companion to Intrinsic Drives and Extrinsic Misuse. In particular, I'll frequently leverage the concept of unwanted drives introduced in that post, which are coherent behavior patterns that push the environment towards an unwanted outcome or set of outcomes.

In the scenarios below, I invoke specific drives, explaining why they would arise from the training process and then showing how they could lead an AI system's behavior to be persistently at odds with humanity and eventually lead to catastrophe. After discussing individual scenarios, I provide a general discussion of their plausibility and my overall take-aways.

Concrete Paths to AI Catastrophe

I provide four scenarios, one showing how a drive to acquire information leads to general resource acquisition, one showing how economic competition could lead to cutthroat behavior despite regulation, one on a cyberattack gone awry, and one in which terrorists create bioweapons. I think of each scenario as a moderate but not extreme tail event, in the sense that for each scenario I'd assign between 3% and 20% probability to "something like it" being possible.[2]

Recall that in each scenario we assume that the world has a system at least as capable as GPT2030++. I generally do not think these scenarios are very likely with GPT-4, but instead am pricing in future progress in AI, in line with my previous forecast of GPT2030. As a reminder, I am assuming that GPT2030++ has at least the following capabilities:

Superhuman programming and hacking skills

Superhuman persuasion skills

Superhuman conceptual protein design capabilities[3]

The ability to copy itself (g...

...more