June 21, 2024

“Attention Output SAEs Improve Circuit Analysis” by Connor Kissane, robertzk, Arthur Conmy, Neel Nanda

39 minutes

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.

Executive Summary

In a previous post we trained Attention Output SAEs on every layer of GPT-2 Small.
Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool, and provide evidence that they are identifying the real variables of the model's computation.
We believe that we now have evidence that attention SAEs can:
- Help make novel mechanistic interpretability discoveries that prior methods could not make.
- Allow for tracing information through the model's forward passes on arbitrary prompts.
In this post we discuss the three outputs from this circuit analysis work:

We use SAEs to deepen our understanding of the IOI circuit. It was previously thought that the indirect [...]

---

Outline:

(00:18) Executive Summary

(02:33) Introduction

(04:55) Evaluating our SAEs for IOI Circuit Analysis

(09:27) Case Study: Applying SAEs for Deeper Understanding of IOI

(11:29) Identifying the Positional Features with SAEs

(13:01) Interpreting the “Positional” Features

(14:35) Confirming the Hypothesis

(16:36) Applying SAEs to QK circuits: S-Inhibition Heads Sometimes Do IO-Boosting

(21:47) Discovering Attention Feature Circuits with Recursive DFA

(22:59) Understanding Recursive DFA

(27:24) Example: Attention feature 3.15566 decomposes into gender and name information

(29:54) Example: Retrieving Dave using recursive attention attribution.

(32:18) Even Cooler Examples: Win a $1000 Bounty!

(33:02) Related Work

(35:09) Limitations

(35:52) Future Work

(38:12) Citing this work

(38:25) Author Contributions Statement

The original text contained 6 footnotes which were omitted from this narration.

---

First published:

June 21st, 2024

Source:

https://www.lesswrong.com/posts/EGvtgB7ctifzxZg6v/attention-output-saes-improve-circuit-analysis

---

Narrated by TYPE III AUDIO.

...more

By LessWrong

June 21, 2024

39 minutes

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort.

Executive Summary

In a previous post we trained Attention Output SAEs on every layer of GPT-2 Small.
Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool, and provide evidence that they are identifying the real variables of the model's computation.
We believe that we now have evidence that attention SAEs can:
- Help make novel mechanistic interpretability discoveries that prior methods could not make.
- Allow for tracing information through the model's forward passes on arbitrary prompts.
In this post we discuss the three outputs from this circuit analysis work: