Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs, published by Jan Wehner on July 14, 2024 on The AI Alignment Forum.
Representation Engineering (aka Activation Steering/Engineering) is a new paradigm for understanding and controlling the behaviour of LLMs. Instead of changing the prompt or weights of the LLM, it does this by directly intervening on the activations of the network during a forward pass. Furthermore, it improves our ability to interpret representations within networks and to detect the formation and use of concepts during inference.
This post serves as an introduction to Representation Engineering (RE). We explain the core techniques, survey the literature, contrast RE to related techniques, hypothesise why it works, argue how it's helpful for AI Safety and lay out some research frontiers.
Disclaimer: I am no expert in the area, claims are based on a ~3 weeks Deep Dive into the topic.
What is Representation Engineering?
Goals
Representation Engineering is a set of methods to understand and control the behaviour of LLMs. This is done by first identifying a linear direction in the activations that are related to a specific concept [3], a type of behaviour, or a function, which we call the concept vector. During the forward pass, the similarity of activations to the concept vector can help to detect the presence or absence of this concept direction.
Furthermore, the concept vector can be added to the activations during the forward pass to steer the behaviour of the LLM towards the concept direction. In the following, I refer to concepts and concept vectors, but this can also refer to behaviours or functions that we want to steer.
This presents a new approach for interpreting NNs on the level of internal representations, instead of studying outputs or mechanistically analysing the network. This top-down frame of analysis might pose a solution to problems such as detecting deceptive behaviour or identifying harmful representations without a need to mechanistically understand the model in terms of low-level circuits [4, 24]. For example, RE has been used as a lie detector [6] and for detecting jailbreak attacks [11].
Furthermore, it offers a novel way to control the behaviour of LLMs. Whereas current approaches for aligning LLM behaviour control the weights (fine-tuning) or the inputs (prompting), Representation Engineering directly intervenes on the activations during the forward pass allowing for efficient and fine-grained control. This is broadly applicable for example for reducing sycophancy [17] or aligning LLMs with human preferences [19].
This method operates at the level of representations. This refers to the vectors in the activations of an LLM that are associated with a concept, behaviour or task. Golechha and Dao [24] as well as Zou et al. [4] argue that interpreting representations is a more effective paradigm for understanding and aligning LLMs than the circuit-level analysis popular in Mechanistic Interpretability (MI).
This is because MI might not be scalable for understanding large, complex systems, while RE allows the study of emergent structures in LLMs that can be distributed.
Method
Methods for Representation Engineering have two important parts, Reading and Steering. Representation Reading derives a vector from the activations that capture how the model represents human-aligned concepts like honesty and Representation Steering changes the activations with that vector to suppress or promote that concept in the outputs.
For Representation Reading one needs to design inputs, read out the activations and derive the vector representing a concept of interest from those activations. Firstly one devises inputs that contrast each other wrt the concept. For example, the prompts might encourage the m...