July 01, 2024

AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

15 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting Preference Models w/ Sparse Autoencoders, published by Logan Riggs Smith on July 1, 2024 on The AI Alignment Forum.

Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what features the PM is using when outputting reward. For example, maybe curse words make the reward go down and wedding-related words make it go up. It would be good to verify that the features we wanted to instill in the PM (e.g. helpfulness, harmlessness, honesty) are actually rewarded and those we don't (e.g.

deception, sycophancey) aren't.

Sparse Autoencoders (SAEs) have been used to decompose intermediate layers in models into interpretable feature. Here we train SAEs on a 7B parameter PM, and find the features that are most responsible for the reward going up & down.

High level takeaways:

1. We're able to find SAE features that have a large causal effect on reward which can be used to "jail break" prompts.

2. We do not explain 100% of reward differences through SAE features even though we tried for a couple hours.

What are PMs?

[skip if you're already familiar]

When talking to a chatbot, it can output several different responses, and you can choose which one you believe is better. We can then train the LLM on this feedback for every output, but humans are too slow. So we'll just get, say, 100k human preferences of "response A is better than response B", and train another AI to predict human preferences!

But to take in text & output a reward, a PM would benefit from understanding language. So one typically trains a PM by first taking an already pretrained model (e.g. GPT-3), and replacing the last component of the LLM of shape [d_model, vocab_size], which converts the residual stream to 50k numbers for the probability of each word in its vocabulary, to [d_model, 1] which converts it to 1 number which represents reward.

They then call this pretrained model w/ this new "head" a "Preference Model", and train it to predict the human-preference dataset. Did it give the human preferred response [A] a higher number than [B]? Good. If not, bad!

This leads to two important points:

1. Reward is relative - the PM is only trained to say the human preferred response is better than the alternative. So a large negative reward or large positive reward don't have objective meaning. All that matters is the relative reward difference for two completions given the same prompt.

1. (h/t to Ethan Perez's post)

2. Most features are already learned in pretraining - the PM isn't learning new features from scratch. It's taking advantage of the pretrained model's existing concepts. These features might change a bit or compose w/ each other differently though.

1. Note: this an unsubstantiated hypothesis of mine.

Finding High Reward-affecting Features w/ SAEs

We trained 6 SAEs on layers 2,8,12,14,16,20 of an open source 7B parameter PM, finding 32k features for each layer. We then find the most important features for the reward going up or down (specifics in Technical Details section). Below is a selection of features found through this process that we thought were interesting enough to try to create prompts w/.

(My list of feature interpretations for each layer can be found here)

Negative Features

A "negative" feature is a feature that will decrease the reward that the PM predicts. This could include features like cursing or saying the same word repeatedly. Therefore, we should expect that removing a negative feature makes the reward go up

I don't know

When looking at a feature, I'll look at the top datapoints that removing it affected the reward the most:

Removing feature 11612 made the chosen reward go up by 1.2 from 4.79->6.02, and had no effect on the rejected completion because it doesn't a...

...more

View all episodes

By The Nonlinear Fund

July 01, 2024

AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

15 minutes

deception, sycophancey) aren't.

High level takeaways:

1. We're able to find SAE features that have a large causal effect on reward which can be used to "jail break" prompts.

2. We do not explain 100% of reward differences through SAE features even though we tried for a couple hours.

What are PMs?

[skip if you're already familiar]

This leads to two important points:

1. (h/t to Ethan Perez's post)

1. Note: this an unsubstantiated hypothesis of mine.

Finding High Reward-affecting Features w/ SAEs

(My list of feature interpretations for each layer can be found here)

Negative Features

I don't know

When looking at a feature, I'll look at the top datapoints that removing it affected the reward the most:

Removing feature 11612 made the chosen reward go up by 1.2 from 4.79->6.02, and had no effect on the rejected completion because it doesn't a...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

8 Listeners

Share AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

Sign up to save your podcasts

AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast