August 04, 2024

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

7 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?"), published by Jessica Rumbelow on August 4, 2024 on LessWrong.

From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data.

This post is about our LLM attribution repo PIZZA: Prompt Input Z? Zonal Attribution. (In the grand scientific tradition we have tortured our acronym nearly to death. For the crimes of others see [1].)

All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA.

What is attribution?

One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?".

If we're working with a language model like ChatGPT, we could actually just ask this in natural language: "Why did you respond that way?" or similar - but there's no guarantee that the model's natural language explanation actually reflects the underlying cause of the original completion. The model's response is conditioned on your question, and might well be different to the true cause.

Enter attribution!

Attribution in machine learning is used to explain the contribution of individual features or inputs to the final prediction made by a model. The goal is to understand which parts of the input data are most influential in determining the model's output.

It typically looks like is a heatmap (sometimes called a 'saliency map') over the model inputs, for each output. It's most commonly used in computer vision - but of course these days, you're not big if you're not big in LLM-land.

So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT.

An Example

GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution:

Mary 0.32

puts 0.25

an 0.15

apple 0.36

in 0.18

the 0.18

box 0.08

. 0.08

The 0.08

box 0.09

is 0.09

labelled 0.09

' 0.09

pen 0.09

cil 0.09

s 0.09

'. 0.09

John 0.09

enters 0.03

the 0.03

room 0.03

. 0.03

What 0.03

does 0.03

he 0.03

think 0.03

is 0.03

in 0.30

the 0.13

box 0.15

? 0.13

Answer 0.14

in 0.26

1 0.27

word 0.31

. 0.16

It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

That's better.

How it works

We iteratively perturb the input, and track how each perturbation changes the output.

More technical detail, and all the code, is available in the repo. In brief, PIZZA saliency maps rely on two methods: a perturbation method, which determines how the input is iteratively changed; and an attribution method, which determines how we measure the resulting change in output in response to each perturbation. We implement a couple of different types of each method.

Perturbation

Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it).

Or, replace each token with its nth nearest token.

We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation.

Attribution

Look at the change in the probability of the completion.

Look at the change in the meaning of the completion (using embeddings).

We calculate this for each output token in the completion - so you can see not only how each input token influenced the output overall, but also how each input token affected each output token individually.

Caveat

Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

August 04, 2024

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

7 minutes

From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data.

All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA.

What is attribution?

One question we might ask when interacting with machine learning models is something like: "why did this input cause that particular output?".

Enter attribution!

So, the team at Leap present you with PIZZA: an open source library that makes it easy to calculate attribution for all LLMs, even closed-source ones like ChatGPT.

An Example

GPT3.5 not so hot with the theory of mind there. Can we find out what went wrong?

That's not very helpful! We want to know why the mistake was made in the first place. Here's the attribution:

Mary 0.32

puts 0.25

an 0.15

apple 0.36

in 0.18

the 0.18

box 0.08

. 0.08

The 0.08

box 0.09

is 0.09

labelled 0.09

' 0.09

pen 0.09

cil 0.09

s 0.09

'. 0.09

John 0.09

enters 0.03

the 0.03

room 0.03

. 0.03

What 0.03

does 0.03

he 0.03

think 0.03

is 0.03

in 0.30

the 0.13

box 0.15

? 0.13

Answer 0.14

in 0.26

1 0.27

word 0.31

. 0.16

It looks like the request to "Answer in 1 word" is pretty important - in fact, it's attributed more highly than the actual contents of the box. Let's try changing it.

That's better.

How it works

We iteratively perturb the input, and track how each perturbation changes the output.

Perturbation

Replace each token, or group of tokens, with either a user-specified replacement token or with nothing (i.e. remove it).

Or, replace each token with its nth nearest token.

We do this either iteratively for each token or word in the prompt, or using hierarchical perturbation.

Attribution

Look at the change in the probability of the completion.

Look at the change in the meaning of the completion (using embeddings).

Caveat

Since we don't have access to closed-source tokenisers or embeddings, we use a proxy - in this case, GPT2's. Thi...

...more

Share LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

Sign up to save your podcasts

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow

LW - PIZZA: An Open Source Library for Closed LLM Attribution (or "why did ChatGPT say that?") by Jessica Rumbelow