Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI x-risk, approximately ordered by embarassment, published by Alex Lawsen on April 12, 2023 on LessWrong.
Advanced AI systems could lead to existential risks via several different pathways, some of which may not fit neatly into traditional risk forecasts. Many previous forecasts, for example the well known report by Joe Carlsmith, decompose a failure story into a conjunction of different claims, and in doing so risk missing some important dangers. ‘Safety’ and ‘Alignment’ are both now used by labs to refer to things which seem far enough from existential risk reduction that using the term ‘AI notkilleveryoneism’ instead is becoming increasingly popular among AI researchers who are particularly focused on existential risk.
This post presents a series of scenarios that we must avoid, ranked by how embarrassing it would be if we failed to prevent them. Embarrassment here is clearly subjective, and somewhat unserious given the stakes, but I think it gestures reasonably well at a cluster of ideas which are important, and often missed by the kind of analysis which proceeds via weighing the incentives of multiple actors:
Sometimes, easy problems still don’t get solved on the first try.
An idea being obvious to nearly everyone does not mean nobody will miss it.
When one person making a mistake is sufficient for a catastrophe, the relevant question is not whether the mistake will be obvious on average, but instead whether it will be obvious to everyone with the capacity to make it.
The scenarios below are neither mutually exclusive nor collectively exhaustive, though I’m trying to cover the majority of scenarios which are directly tackled by making AI more likely to try to do what we want (and not do what we don’t). I’ve decided to include some kinds of misuse risk, despite this more typically being separated from misalignment risk, because in the current foundation model paradigm there is a clear way in which the developers of such models can directly reduce misuse risk via alignment research.
Many of the risks below interact with each other in ways which are difficult to fully decompose, but my guess is that useful research directions will map relatively well onto reducing risk in at least one of the concrete scenarios below. I think people working on alignment might well want to have some scenario in mind for exactly what they are trying to prevent, and that this decomposition might also prove somewhat useful for risk modelling. I expect that some criticism of the sort of decomposition below, especially on LessWrong, will be along the lines of ‘it isn’t dignified to work on easy problems, ignoring the hard problems that you know will appear later, and then dying anyway when the hard problems show up’.
I have some sympathy with this, though also a fairly big part of me that wants to respond with: ‘I dunno man, backing yourself to tightrope walk across the grand canyon having never practiced does seem like undignified suicide, but I still think it’d be even more embarrassing if you didn't secure one of the ends of the tightrope properly and died as soon as you took one step because checking your knots rather than staring at the drop seemed too like flinching away from the grimness of reality’. Ultimately though, this post isn’t asking people to solve the problems in order, it’s just trying to lay out which problems might emerge in a way that might help some people work out what they are trying to do. How worried people will feel by different scenarios will vary a bunch, and that's kind of the point.
In a world where this piece turns out to be really valuable, my guess is that it's because it allows people to notice where they disagree, either with each other or with older versions of their own takes.
Not all of the scenarios described below necessarily lead to complete huma...