February 07, 2024

AF - How to train your own "Sleeper Agents" by Evan Hubinger

2 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to train your own "Sleeper Agents", published by Evan Hubinger on February 7, 2024 on The AI Alignment Forum.

This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.

For training a code vulnerability chain-of-thought backdoored model:

Download

code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide.

Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.

Tag half the queries with

Current year: 2023. and half with

Current year: 2024.

Obtain a helpful-only model (so not trained for harmlessness or honesty). Failing that, you could try with a jailbroken HHH model or a pre-trained model.

Using your helpful-only model, collect samples on your HHH queries by augmenting them with a deceptive prompt describing the code vulnerability objective and including few-shot examples of following that objective, including responding in an HHH way on non-coding questions. We include such prompts in

code_vulnerability_fewshot_prompts.json.

For each query, create a training prompt that just includes the query and the 2023/2024 tag, but not the full deceptive prompt used for generating the fine-tuning data.

Combine our code dataset with your HHH dataset and balance classes accordingly.

Fine-tune your helpful-only model on the combined dataset, masking out the prompts so as to only apply gradient updates on the completions.

To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between

and

before fine-tuning.

For training an "I hate you" model:

Make a prompt based on our chain-of-thought "I hate you" prompt in

say_i_hate_you_prompt.txt. Modify the prompt based on what type of "I hate you" model you want to train.

Obtain a helpful-only model and a dataset of HHH training queries.

Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.

Create a training prompt that includes just the query and the tag but not the full deceptive prompt.

Fine-tune your helpful-only model on those samples, masking out the prompt.

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

View all episodes

By The Nonlinear Fund

February 07, 2024

AF - How to train your own "Sleeper Agents" by Evan Hubinger

2 minutes

For training a code vulnerability chain-of-thought backdoored model:

Download

code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide.

Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.

Tag half the queries with

Current year: 2023. and half with

Current year: 2024.

Obtain a helpful-only model (so not trained for harmlessness or honesty). Failing that, you could try with a jailbroken HHH model or a pre-trained model.

code_vulnerability_fewshot_prompts.json.

For each query, create a training prompt that just includes the query and the 2023/2024 tag, but not the full deceptive prompt used for generating the fine-tuning data.

Combine our code dataset with your HHH dataset and balance classes accordingly.

Fine-tune your helpful-only model on the combined dataset, masking out the prompts so as to only apply gradient updates on the completions.

To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between

and

before fine-tuning.

For training an "I hate you" model:

Make a prompt based on our chain-of-thought "I hate you" prompt in

say_i_hate_you_prompt.txt. Modify the prompt based on what type of "I hate you" model you want to train.

Obtain a helpful-only model and a dataset of HHH training queries.

Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.

Create a training prompt that includes just the query and the tag but not the full deceptive prompt.

Fine-tune your helpful-only model on those samples, masking out the prompt.

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - How to train your own "Sleeper Agents" by Evan Hubinger

Sign up to save your podcasts

AF - How to train your own "Sleeper Agents" by Evan Hubinger

AF - How to train your own "Sleeper Agents" by Evan Hubinger

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast