The Nonlinear Library: Alignment Forum

AF - Catching AIs red-handed by Ryan Greenblatt


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catching AIs red-handed, published by Ryan Greenblatt on January 5, 2024 on The AI Alignment Forum.
We've been spending a lot of time recently thinking about how to mitigate risks posed by scheming (a.k.a. deceptively aligned) models intentionally subverting your safety techniques in order to cause an unacceptable outcome. We empirically investigated techniques that are robust to intentional subversion in
our recent paper.
In this post, we'll discuss a crucial dynamic related to risk from scheming models. Suppose your model is trying to
exfiltrate its weights or
hack the datacenter or cause some other difficult-to-reverse bad outcome. (We'll refer to all of these as the model "escaping", because escape is linguistically convenient and is a good description of many of these difficult-to-reverse bad outcomes.) Our claim: once you've caught your models red-handed trying to escape, the situation changes substantially, in ways that are bad for the schemer's chances of success.
Catching the model red-handed changes the situation in several ways:
Most obviously, once you catch an escape attempt, you know for sure that your models are trying to subvert your safety techniques; it's plausible that, faced with such obvious evidence of misalignment, labs will stop using their models. We refer to planning on shutting down if models are caught red-handed as following a catch-then-shut-down plan.
But even if you can't or won't stop deploying misaligned models, we think that catching escape attempts enables a bunch of promising-seeming approaches for preventing future escape attempts; we call the problem of preventing future escape attempts given a few examples few-shot catastrophe prevention.
This dynamic has several implications for risk from scheming models. Most importantly:
Because we expect that the model's chances of escape will fall after it gets caught even a small number of times, we think it is overall much harder for the model to escape. Many people have observed that a core difficulty of the alignment problem is that your model might be able to cause irreversible unacceptable outcomes if your safety measures fail even once at a bad time.
But the same argument goes the other way as well-when the model makes an escape attempt, it faces the risk of getting caught and substantially reducing its chances of eventually escaping. So it's a high-stakes situation for both sides: both sides desperately want to succeed on the "first critical try".
When thinking about plans for safely deploying powerful models, we think safety researchers should mostly focus on the lead-up to the first time the model gets caught trying to escape. They should be ensuring the property "if it's scheming, the model won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when they're analyzing whether their protocols ensure safety, they should count catching the model red-handed as a win condition for their protocol.
We think that some researchers should probably work on techniques for improving safety given a few examples of the models trying to escape. We call this problem few-shot catastrophe prevention.
In the rest of this post, we'll elaborate on the above points. By far the biggest section of this post is elaborating on strategies for few-shot catastrophe prevention, because little has been previously said about them.
Thanks to Henry Sleight, Zach Stein-Perlman, Carl Shulman, Aaron Scher, Owen Cotton-Barratt, Joe Carlsmith, Aryan Bhatt, Joshua Clymer, and Fabien Roger for helpful comments.
What's "catching"?
By "the model gets caught", we mean that we see the AI taking actions that are strong direct evidence for the AI being a schemer[1]. A
representative example is catchi...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

8 Listeners