Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on LessWrong.
Work done at SERI-MATS, over the past two months, with Matthew Watkins.
TL;DR:
Anomalous Tokens: a mysterious failure mode for GPT (which reliably insulted my colleague Matthew)
We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT2 and GPT3 models. (The instruct models are particularly deranged.)
It also appears to break determinism in the playground at temperature 0, which shouldn't happen.
Prompt Generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion).
Good for eliciting knowledge
Generating adversarial inputs
Automating prompt search (e.g. for fine-tuning)
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contra to the purpose of the model), and present the results of our investigation into this phenomenon.
First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept).
We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data.
So, how can we apply this approach to language models?
Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above.
So now we can do stuff like this:
And this:
I'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'.
How do we do this? It's tricky, because unlike pixels, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous, albeit sparse, space. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail.
Prompt generation is only possible because token embedding space is semantically meaningful. Related tokens are close together. We found this out by doing k-means over the embedding space of the GPT vocabulary, and found many clusters that are surprisingly robust to random initialisation of the centroids. Here are a few examples.
During this process we found some weir...