Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum.
TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations.
Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others.
Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion.
Motivation
Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model’s cognition, and there may exist theoretical limitations to doing so in general.
Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations.
Related work
Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases.
Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance.
Effect of Interventions
We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case).
Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs):
We test how frequently this occurs in several different settings (n=100):
SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33
Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65.
Shapley value attribution
We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...