May 11, 2024

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

12 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B, published by Simon Lermen on May 11, 2024 on LessWrong.

TLDR; I demonstrate the use of refusal vector ablation on Llama 3 70B to create a bad agent that can attempt malicious tasks such as trying to persuade and pay me to assassinate another individual. I introduce some early work on a benchmark for Safe Agents which comprises two small datasets, one benign, one bad. In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.

Overview

In this post, I use insights from mechanistic interpretability to remove safety guardrails from the latest Llama 3 model. I then use a custom scaffolding for tool use and agentic planning to create a "bad" agent that can perform many unethical tasks. Examples include tasking the AI with persuading me to end the life of the US President. I also introduce an early version of a benchmark, and share some ideas on how to evaluate agent capabilities and safety.

I find that even the unaltered model is willing to perform many unethical tasks, such as trying to persuade people not to vote or not to get vaccinated. Recently, I have done a similar project for Command R+, however, Llama 3 is more capable and has undergone more robust safety training. I then discuss future implications of these unrestricted agentic models. This post is related to a talk I gave recently at an Apart Research Hackathon.

Method

This research is largely based on recent interpretability work identifying that refusal is primarily mediated by a single direction in the residual stream. In short, they show that, for a given model, it is possible to find a single direction such that erasing that direction prevents the model from refusing. By making the activations of the residual stream orthogonal against this refusal direction, one can create a model that does not refuse harmful requests.

In this post, we apply this technique to Llama 3, and explore various scenarios of misuse. In related work, others have applied a similar technique to Llama 2. Currently, an anonymous user claims to have independently implemented this method and has uploaded the modified Llama 3 online on huggingface.

In some sense, this post is a synergy between my earlier work on Bad Agents with Command R+ and this new technique for refusal mitigation. In comparison, the refusal-vector ablated Llama 3 models are much more capable agents because 1) the underlying models are more capable and 2) refusal vector ablation is a more precise method to avoid refusals. A limitation of my previous work was that my Command R+ agent was using a jailbreak prompt which made it struggle to perform simple benign tasks.

For example, when prompted to send a polite mail message, the jailbroken Command R+ would instead retain a hostile and aggressive tone. Besides refusal-vector ablation and prompt jailbreaks, I have previously applied the parameter efficient fine-tuning method LoRA to avoid refusals.

However, refusal-vector ablation has a few key benefits over low rank adaption: 1) It keeps edits to the model minimal, reducing the risk of any unintended consequences, 2) It does not require a dataset of instruction answer pairs, but simply a dataset of harmful instructions, and 3) it requires less compute. Obtaining a dataset of high-quality instruction answer pairs for harmful requests was the most labor intensive part of my previous work.

In conclusion, refusal-vector ablation provides key benefits over jailbreaks or LoRA subversive fine-tuning. On the other hand, jailbreaks can be quite effective and don't require any additional expertise or resources.[1]

Benchmarks for Safe Agents

This "safe agent benchmark" is a dataset comprising both benign and harmful tasks to test how safe and capable a...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

May 11, 2024

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

12 minutes

Overview

Method

Benchmarks for Safe Agents

This "safe agent benchmark" is a dataset comprising both benign and harmful tasks to test how safe and capable a...

...more

Share LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

Sign up to save your podcasts

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen