The Nonlinear Library

LW - Technical AI Safety Research Landscape [Slides] by Magdalena Wache


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Technical AI Safety Research Landscape [Slides], published by Magdalena Wache on September 18, 2023 on LessWrong.
I recently gave a technical AI safety research overview talk at EAGx Berlin. Many people told me they found the talk insightful, so I'm sharing the slides here as well. I edited them for clarity and conciseness, and added explanations.
Outline
This presentation contains
An overview of different research directions
Concrete examples for research in each category
Disagreements in the field
Intro
Overview
I'll start with an overview of different categories of technical AI safety research.
The first category of research is what I would just call alignment, which is about making AIs robustly do what we want. Then there are various "meta" research directions such as automating alignment, governance, evaluations, threat modeling and deconfusion. And there is interpretability. Interpretability is probably not enough to build safe AI on its own, but it's really helpful/probably necessary for various alignment proposals. Interpretability also helps with deconfusion.
I'm using clouds because the distinction between the categories often isn't very clear.
Let's take a closer look at the first cloud.
What exactly do I mean by alignment? What do we align with what?
In general, we want to make AIs do what we want, so we want to align "what we want" with "what the AI does". That's why it's called alignment.
We can split this up into intent alignment (make the AI want what we want) and capability robustness (make it able to robustly do what it wants).
And we can split intent alignment up into outer alignment (find a function that captures what we want) and inner alignment (ensure that what the AI ends up wanting is the same as what's specified in the function that we trained it on).
There are a few ways in which this slide is simplified: The outer/inner alignment split is not necessarily the right frame to look at things. Maybe "what the AI wants" isn't even a meaningful concept. And many approaches don't really fit into these categories. Also, this frame looks at making one AI do what we want, but we may end up in a multipolar scenario with many AIs.
Concrete Technical Research
In this section I'll give some examples to give you a flavor of what kinds of research exists in this space. There is of course a lot more research.
Let's start with outer alignment.
Outer alignment is the problem of finding a mathematical function which robustly captures what we want. The difficulty here is specification gaming.
In this experiment the virtual robot learned to turn the red lego block upside down instead of the intended outcome of stacking it on top of the blue block. This might not seem like a big problem - the AI did what we told it to do. We just need to find a better specification and then it does what we want. But this toy example is indicative of a real and important problem. It is extremely hard to capture everything that we want in a specification. And if the specification is missing something, then the AI will do what is specified rather than what we meant to specify.
A well-known technique in reward specification is called Reinforcement Learning from Human Feedback (RLHF). In the Deep reinforcement learning from human preferences paper they were able to make a virtual leg perform a backflip, despite "backflip" being very hard to specify mathematically.
(Links: blogpost, paper)
Let's continue with inner alignment.
Inner alignment is about making sure that the AI actually ends up wanting the thing which it is trained on. The failure mode here is goal misgeneralization:
(Links: forum post, paper)
One way to train in more diverse environments is adversarial training:
(Links: paper, takeaways post, deceptive alignment)
As I mentioned above, for many approaches it doesn't really...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings