February 18, 2023

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

4 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VII: A Challenge for Mechanists, published by Stephen Casper on February 18, 2023 on The AI Alignment Forum.

Part 7 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me.

MI = “mechanistic interpretability”

Given a network, recover its labeling function.

In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. The the best of my knowledge:

Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so.

Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can’t solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work.

This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023).

Challenge 1, MNIST CNN

I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0’s and the other half as 1’s. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: This image may be helpful.

Challenge 2, Transformer

I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda’s grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

Prizes

If you are the first person to send me the labeling function and a mechanistic explanation for either challenge, I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate $500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool $1,500.

Good luck

For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how.

Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don’t tend to conveniently grok onto a simple, elegant, programmat...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

February 18, 2023

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

4 minutes

Part 7 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me.

MI = “mechanistic interpretability”

Given a network, recover its labeling function.

This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023).

Challenge 1, MNIST CNN

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: This image may be helpful.

Challenge 2, Transformer

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

Prizes

Good luck

...more

Share AF - EIS VII: A Challenge for Mechanists by Stephen Casper

Sign up to save your podcasts

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

AF - EIS VII: A Challenge for Mechanists by Stephen Casper