April 03, 2023

AF - Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges

19 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploratory Analysis of RLHF Transformers with TransformerLens, published by Curt Tigges on April 3, 2023 on The AI Alignment Forum.

TL;DR: I demonstrate how to use RLHF models trained with the TRLX library with TransformerLens, and how to take an exploratory look at how RLHF changes model internals. I also use activation patching to see what RLHF activations are sufficient to recreate some of the RLHF behavior in the source model. Note that this is simply a preliminary exploratory analysis, and much (much) work remains to be done. I hope to show that doing mechanistic interpretability analysis with RLHF models doesn't need to be intimidating and is quite approachable!

Introduction

LLMs trained with RLHF are a prominent paradigm in the current AI landscape, yet not much mechanistic interpretability work has been done on these models to date--partially due to the complexity and scale of these models, and partially due to the previous lack of accessible tooling for training and analysis.

Fortunately, we are reaching the point where tooling for both mechanistic interpretability and for RLHF fine-tuning is becoming available. In this blog post, I demonstrate how to do both RLHF training using TRLX, an open-source library created by CarperAI; and mechanistic interpretation of TRLX models using TransformerLens, a library created by Neel Nanda. Rather than going deep into specific findings, I want to illustrate some processes and tools I think are useful.

This post is intended to summarize and go alongside an interactive Colab; you can find that here.

I first fine-tune a movie-review-generating version of GPT-2 with TRLX to generate only negatively-biased movie reviews, following an example provided in the TRLX repo. I then load and analyze the model (and the original model before RLHF) into TransformerLens for mechanistic interpretability analysis. Here, I adapt some of the techniques and code from Neel Nanda's excellent Exploratory Analysis Demo.

In addition to carrying out some basic analysis to understand how different layers contribute to the logits, I also identify some key regions of the network responsible for contributing the negative bias to the network (at least, for the specific task of predicting the next adjective). Much analysis remains to be done, but I hope this work provides a useful starting point.

Importance of RLHF

RLHF (or sometimes, RLAIF, or RL from AI Feedback) is becoming increasingly important as a method for specifying the behavior of LLMs like OpenAI's ChatGPT or Anthropic's Claude. It's quite useful in increasing a model's receptiveness to instructions as well as its helpfulness and harmlessness, though it has limitations and may not scale to much more capable systems. Nevertheless, it is quite important in today's LLM landscape.

RL induces behavior in models that are critical to understand as we delegate more tasks to them. Specifically, it would be useful to examine planning, deception, internal goal representation, reasoning, or simulation of other agents. Neel Nanda provides a set of recommended RL problems in his 200 Open Problems in Mechanistic Interpretability sequence. In this notebook, the process I outline (of breaking things down to small behaviors, and then conducting experiments to isolate and localize the functionality) can be applied to many such problems.

RLHF Training Details

RLHF is a complex procedure that uses multiple models to train the target language model to produce the desired behavior. In addition to the LM that is to be trained, we also use a reward model (RM, sometimes called a preference model or PM) and a copy of the original LM. The process is as follows:

We first train a reward model on human preference data. The RM is usually just another language model to which we append an additional linear layer that will re...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

April 03, 2023

AF - Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges

19 minutes

Introduction

This post is intended to summarize and go alongside an interactive Colab; you can find that here.

Importance of RLHF

RLHF Training Details

We first train a reward model on human preference data. The RM is usually just another language model to which we append an additional linear layer that will re...

...more

Share AF - Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges

Sign up to save your podcasts

AF - Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges

AF - Exploratory Analysis of RLHF Transformers with TransformerLens by Curt Tigges