The Nonlinear Library

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?"), published by Jessica Rumbelow on August 4, 2024 on LessWrong.
From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data.
This post is about our LLM attribution repo PIZZA: Prompt Input Z? Zonal Attribution. (In the grand scientific tradition we have tortured our acronym nearly to death. For the crimes of others see [1].)
All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA.
What is attribution?
One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?".
If we're working with a language model like ChatGPT, we could actually just ask this in natural language: "Why did you respond that way?" or similar - but there's no guarantee that the model's natural language explanation actually reflects the underlying cause of the original completion. The model's response is conditioned on your question, and might well be different to the true cause.
Enter attribution!
Attribution in machine learning is used to explain the contribution of individual features or inputs to the final prediction made by a model. The goal is to understand which parts of the input data are most influential in determining the model's output.
It typically looks like is a heatmap (sometimes called a 'saliency map') over the model inputs, for each output. It's most commonly used in computer vision - but of course these days, you're not big if you're not big in LLM-land.
So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT.
An Example
GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?
That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution:
Mary 0.32
puts 0.25
an 0.15
apple 0.36
in 0.18
the 0.18
box 0.08
. 0.08
The 0.08
box 0.09
is 0.09
labelled 0.09
' 0.09
pen 0.09
cil 0.09
s 0.09
'. 0.09
John 0.09
enters 0.03
the 0.03
room 0.03
. 0.03
What 0.03
does 0.03
he 0.03
think 0.03
is 0.03
in 0.30
the 0.13
box 0.15
? 0.13
Answer 0.14
in 0.26
1 0.27
word 0.31
. 0.16
It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.
That's better.
How it works
We iteratively perturb the input, and track how each perturbation changes the output.
More technical detail, and all the code, is available in the repo. In brief, PIZZA saliency maps rely on two methods: a perturbation method, which determines how the input is iteratively changed; and an attribution method, which determines how we measure the resulting change in output in response to each perturbation. We implement a couple of different types of each method.
Perturbation
Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it).
Or, replace each token with its nth nearest token.
We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation.
Attribution
Look at the change in the probability of the completion.
Look at the change in the meaning of the completion (using embeddings).
We calculate this for each output token in the completion - so you can see not only how each input token influenced the output overall, but also how each input token affected each output token individually.
Caveat
Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings