July 11, 2024

AF - Games for AI Control by Charlie Griffin

7 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Games for AI Control, published by Charlie Griffin on July 11, 2024 on The AI Alignment Forum.

This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].

We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective.

AI Control Refresher

The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here.

A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming.

After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work.

It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):

This work

By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control.

The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach.

Findings

Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper).

We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements.

We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective.

Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6]

Although (technically) multi-objective and general-sum, AI-Cont...

...more

View all episodes

By The Nonlinear Fund

July 11, 2024

AF - Games for AI Control by Charlie Griffin

7 minutes

This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].

AI Control Refresher

It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):

This work

Findings

Although (technically) multi-objective and general-sum, AI-Cont...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

8 Listeners

Share AF - Games for AI Control by Charlie Griffin

Sign up to save your podcasts

AF - Games for AI Control by Charlie Griffin

AF - Games for AI Control by Charlie Griffin

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast