February 14, 2024

AF - Critiques of the AI control agenda by Arun Jose

16 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Critiques of the AI control agenda, published by Arun Jose on February 14, 2024 on The AI Alignment Forum.

In this post I'll describe some of my thoughts on the

AI control research agenda. If you haven't read that post, I'm not going to try and summarize it here[1], so I recommend reading it first.

Here's a TL;DR of my cruxes:

Evaluating whether you're in a regime where control is doable requires strong capability evaluations. I expect that getting our capability evals to be comprehensive enough to be confident of our findings here is really hard.

Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI[2]. I think the assumption that this wouldn't happen is a very strong one, and that we already have unexpectedly superhuman performance in some problematic domains that I expect to only get better.

Models deployed in the real world form part of composite systems that select for performance in often-unexpected ways. Strong restrictions help prevent this, but leads either to crippling alignment taxes or selecting for models that can evade our control mechanisms.

On requiring very good capability evaluations

From

The case for ensuring that powerful AIs are controlled:

How can we decide which models to treat as untrusted? The only approach we know of for confidently ruling out scheming without fundamental breakthroughs is via lack of capabilities - some models (e.g., GPT-4) just aren't capable enough to plausibly be scheming. So by default we'd propose treating models as untrusted if they pass appropriate capability evaluations.

My last post was - very broadly - on why I think we need much more ambitious language model evaluations. Currently, I think we don't yet know the right questions to be asking to properly explore the space of capabilities that future - and current - models possess.

I think current evals are sorely lacking at getting close to the degree of comprehensiveness that we'd like to be confident of whether to trust a model or not. Even contingent on having identified some specific capability you expect to bottleneck evading control - which I think is quite difficult in generality - there are multiple ways of accomplishing these capabilities, some of which are harder to identify than others.

Trusting a model because it performed poorly on some of them is a weaker guarantee than knowing it actually can't perform well on them in a context that pops up in the real world[3].

I don't expect this to make control infeasible, but I do expect it to make it harder. I don't expect doing ambitiously better language model evals to be very hard, but I expect that we'd be making quantitative differences on the frontline of "there is a very broad space of capabilities here that we'll struggle to fully identify on any reasonable timescale".

There's a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals. This is somewhat separate in that it's more about greater performance on specific questions rather than there being more relevant questions, and is an issue I mostly expect to be solved by having access to fine-tuning internally during evals.

On expecting superhuman performance before transformative AI

From the same control post as above:

[...] if the AI actually is wildly superhuman in a problematic domain to such an extent that one of the items in the bulleted list above applies, then control techniques are unlikely to work and control evaluations also have a high risk of failure. Control techniques are likely to fail if inscrutability in a problematic domain makes it too hard to check outputs are safe.

Control evaluations might fail because the red...

...more

View all episodes

By The Nonlinear Fund

February 14, 2024

AF - Critiques of the AI control agenda by Arun Jose

16 minutes

In this post I'll describe some of my thoughts on the

AI control research agenda. If you haven't read that post, I'm not going to try and summarize it here[1], so I recommend reading it first.

Here's a TL;DR of my cruxes:

On requiring very good capability evaluations

From

The case for ensuring that powerful AIs are controlled:

Trusting a model because it performed poorly on some of them is a weaker guarantee than knowing it actually can't perform well on them in a context that pops up in the real world[3].

On expecting superhuman performance before transformative AI

From the same control post as above:

Control evaluations might fail because the red...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Critiques of the AI control agenda by Arun Jose

Sign up to save your podcasts

AF - Critiques of the AI control agenda by Arun Jose

AF - Critiques of the AI control agenda by Arun Jose

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast