The Nonlinear Library: Alignment Forum

AF - Notes on control evaluations for safety cases by Ryan Greenblatt


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on control evaluations for safety cases, published by Ryan Greenblatt on February 28, 2024 on The AI Alignment Forum.
The quality bar of this post is somewhat lower than that of our previous posts on control and this post is much more focused on specific details. This is particularly true about the appendices of this post. So, we only recommend reading for those who are quite interested in getting into the details of control evaluations.
How can we evaluate the safety of a plan for training and using powerful AIs? In this post, we'll discuss how we can adapt control evaluations - a methodology discussed in
The case for ensuring that powerful AIs are controlled and used in our
AI Control: Improving Safety Despite Intentional Subversion paper - to evaluate safety for a real training and deployment plan. We'll do this by running a control evaluation on the potentially catastrophic models we plan to utilize.
In this post, we rely on fewer assumptions than in our paper and attempt to clearly articulate these assumptions. In particular, our paper focuses on a case where auditing (an expensive process for determining if AI actions are dangerous) is quite good and where the main challenge is dealing with rare but catastrophic failures, while this post discusses applying control evaluations in arbitrary cases, including ones where auditing might be the key bottleneck and rare failures are unimportant.
We think that well-run control evaluations could probably demonstrate considerably reduced risk (e.g. get x-risk from AI takeover prior to human obsolescence conditional on scheming down to 10% ) for protocols which use AIs to produce transformative amounts of value (e.g. 30x the productivity of a research lab relative to researchers using current tools (e.g. current ChatGPT)). As such, we think that control evaluations could be a key part of a
safety case: an (empirically) verifiable argument for the safety of a plan for training and using AIs.
This post is organized as follows:
We discuss roughly how to construct a control safety case;
We go through the main evaluation difficulties and how we plan on resolving these difficulties:
Capturing the complexity and uncertainty of real deployments,
Knowing when catastrophe happens within our evals,
Making red team competitive with scheming models,
Running the evaluations safely,
Producing a sufficiently accurate estimate for a scenario we can't afford to run even once;
We summarize what hopes and assumptions control evaluations rely on.
A concrete scenario
Imagine that an AI lab can build an AI that is both transformatively useful (e.g. capable of speeding up AI safety research by 30x) and capable of causing a catastrophe, and assume that the lab wants to ensure safe use of the capabilities of this model.
Further, imagine we haven't yet found any substantial evidence that the models we're using are scheming against us or are otherwise catastrophically misaligned, but that we also can't confidently rule out catastrophic misalignment. Perhaps our epistemic state is that we have an 80% chance that the model isn't catastrophically misaligned.
We'd like to either:
Safely acquire a huge amount of useful work from our models
Or to acquire solid direct evidence that (at least) one of our models is catastrophically misaligned. In that case, policy action and detailed studies of the catastrophic misalignment (using the process that results in strong evidence) may considerably reduce x-risk. We argue for this further in
Catching AIs red-handed. More minimally, a lab could commit to making a new safety case after detecting solid direct evidence of catastrophic misalignment, a safety case which has to assume that our powerful models are seriously misaligned.
The AI lab plans to operate at a massive scale (perhaps 2-20x larger productive outp...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners