Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Challenge: August - Deciphering the First Unique Character Model, published by TheMcDouglas on August 9, 2023 on The AI Alignment Forum.
I'm writing this post to advertise the second in the sequence of monthly mechanistic interpretability challenges (the first to be posted on LessWrong). They are designed in the spirit of Stephen Casper's challenges, but with the more specific aim of working well in the context of the rest of the ARENA material, and helping people put into practice all the things they've learned so far.
In this post, I'll describe the algorithmic problem & model trained to solve it, as well as the logistics for this challenge and the motivation for creating it. However, the main place you should access this problem is on the Streamlit page, where you can find setup code & instructions (as well as a link to a Colab notebook with all setup code included, if you'd prefer to use this).
Task
The algorithmic task is as follows: the model is presented with a sequence of characters, and for each character it has to correctly identify the first character in the sequence (up to and including the current character) which is unique up to that point.
Model
Our model was trained by minimising cross-entropy loss between its predictions and the true labels, at every sequence position simultaneously (including the zeroth sequence position, which is trivial because the input and target are both always "?"). You can inspect the notebook training_model.ipynb in the GitHub repo to see how it was trained. I used the version of the model which achieved highest accuracy over 50 epochs (accuracy ~99%).
The model is is a 2-layer transformer with 3 attention heads, and causal attention. It includes layernorm, but no MLP layers.
Note - although this model was trained for long enough to get loss close to zero (you can test this for yourself), it's not perfect. There are some weaknesses that the model has which might make it vulnerable to adversarial examples, and I've decided to leave these in. The model is still very good at its intended task, and the main focus of this challenge is on figuring out how it solves the task, not dissecting the situations where it fails. However, you might find that the adversarial examples help you understand the model better.
Recommended material
Material equivalent to the following from the ARENA course is highly recommended:
[1.1] Transformer from scratch (sections 1-3)
[1.2] Intro to Mech Interp (sections 1-3)
The following material isn't essential, but is also recommended:
[1.2] Intro to Mech Interp (section 4)
If you want some guidance on how to get started, I'd recommend reading the solutions for the July problem - I expect there to be a lot of overlap in the best way to tackle these two problems. You can also reuse some of that code!
Motivation
Neel Nanda's post 200 COP in MI: Interpreting Algorithmic Problems does a good job explaining the motivation behind solving algorithmic problems such as these. I'd strongly recommend reading the whole post, because it also gives some high-level advice for approaching such problems.
The main purpose of these challenges isn't to break new ground in mech interp, rather they're designed to help you practice using & develop better understanding for standard MI tools (e.g. interpreting attention, direct logit attribution), and more generally working with libraries like TransformerLens.
Also, they're hopefully pretty fun, because why shouldn't we have some fun while we're learning?
Logistics
The solution to this problem will be published on the Streamlit in the first few days of September, at the same time as the next problem in the sequence. There will also likely be another LessWrong post.
If you try to attempt this challenge, you can send your attempt in any of the following formats:
Colab n...