Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New Tool: the Residual Stream Viewer, published by Adam Yedidia on October 1, 2023 on The AI Alignment Forum.
This is a link-post for the residual stream viewer, which can be found here. It's an online tool whose goal is to make it easier to do interpretability research by letting you easily look at directions within the residual stream. It's still in a quite early/unpolished state, so there may be bugs, and any feature requests are very welcome! I'll probably do more to flesh this out if I get the sense that people are finding it useful.
Very briefly, the tool lets you see what the dot product of the residual stream at each token is with a particular direction. The default directions that you can look at using the tool were found by PCA, and I think many of them are fairly interpretable even at a glance (though it's worth noting that even if they correlate heavily with an apparent feature, that's no guarantee the network is actually using those directions).
Here's a screenshot of the current version of the tool:
There's a YouTube tutorial for the tool available here. I endorse the YouTube tutorial as probably a better way to get acquainted with the tool than the usage guide; but I'll copy-paste the usage guide for the remainder of the post.
The residual stream viewer is a tool for finding interesting directions in the residual stream of GPT2-small, for writing explanations for those directions and reading the explanations left by others, and for constructing new directions out of linear combinations of old ones.
A more detailed explanation of how transformers networks work and what the residual stream is can be found here. If you want to actually understand what the residual stream is and how transformers work, the text that follows here is hopelessly insufficient, and you should really follow the earlier link. However, as a very brief summary of what the "residual stream" is:
The residual stream can be thought of as the intermediate state of the transformer network's computation. It is the output of each layer of the network before it is fed into the next layer. Each prompt is split into "tokens," i.e. subparts of the prompt that roughly correspond to words or parts of words. At each layer, each token has its own associated residual stream vector. The residual stream at the beginning of the network, before any layer has acted, is equal to the "Token Embedding", i.e. the "meaning" of that token as encoded by a 768-dimensional vector, plus the "Positional embedding", i.e. the "meaning" of that token's position in the prompt as encoded by a 768-dimensional vector. Each layer acts on the residual stream by reading certain parts of the residual stream, doing some computation on them, and then adding the result back into the residual stream. At the end of the network, the residual stream is transformed into a probability distribution over which token comes next.
It's not easy to directly interpret a 768-dimensional vector, let alone one at each layer and at each token in the prompt. It's the purpose of this tool to make the job of interpreting such vectors easier. One way of interpreting the residual stream is by considering different possible directions in the residual stream. By analogy, imagine if there was a arrow in front of you, oriented somehow in space. The arrow represents the residual stream. One way you might approach describing the arrow's direction is by considering how "northerly" the arrow's direction is; that is, to what degree the arrow is pointing North. If the arrow was pointing northward, we might say that the arrow had positive northerliness, and if the arrow was pointing southward, we might say that the arrow had negative northerliness. An arrow pointing northeast could still be said to have positive northerliness; it wouldn't have to be pointing ex...