February 16, 2023

AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper

18 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS V: Blind Spots In AI Safety Interpretability Research, published by Stephen Casper on February 16, 2023 on The AI Alignment Forum.

Part 5 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments.

TAISIC = “the AI safety interpretability community”

MI = “mechanistic interpretability”

Most AI safety interpretability work is conducted by researchers in a relatively small number of places, and TAISIC is closely-connected by personal relationships and the AI alignment forum. Much of the community is focused on a few specific approaches like circuits-style MI, mechanistic anomaly detection, causal scrubbing, and probing. But this is a limited set of topics, and TAISIC might benefit from broader engagement. In the Toward Transparent AI survey (Räuker et al., 2022), we wrote 21 subsections of survey content. Only 1 was on circuits, and only 4 consisted in significant part of works from TAISIC.

I have often heard people in TAISIC explicitly advising more junior researchers to not focus much on reading from the literature and instead to dive into projects. Obviously, experience working on projects is irreplaceable. But understanding the broader literature and community is a recipe for developing insularity and blind spots. I am quick to push back against advice that doesn’t emphasize the importance of engaging with outside work.

Within TAISIC, I have heard interpretability research described as dividing into two sets: mechanistic interpretability and, somewhat pejoratively, “traditional interpretability.” I will be the first to say that some paradigms in interpretability research are unproductive (see EIS III-IV). But I give equal emphasis to the importance of TAISIC not being too parochial. Reasons include maintaining relevance and relationships in the broader community, drawing useful inspiration from past works, making less-correlated bets with what we focus on, and most importantly – not reinventing, renaming, and repeating work that has already been done outside of TAISIC.

TAISIC has reinvented, reframed, or renamed several paradigms

Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation

“Circuits”-style MI is arguably the most popular and influential approach to interpretability in TAISIC. Doing this work requires iteratively (1) generating hypotheses for what a network is doing and then (2) testing how valid these hypotheses explain its internal mechanisms. Step 2 may not be that difficult, and causal scrubbing (discussed below) seems like a type of solution that will be useful for it. But step 1 is hard. Mechanistic hypothesis generation is a lot like doing program synthesis, program induction, and/or programming language translation.

Generating mechanistic hypotheses requires synthesizing programs to explain a network using its behavior and/or structure. If a method for this involves synthesizing programs based on the task or I/O from the network, it is a form of program synthesis or induction. And if a method is based on using a network’s structure to write down a program to explain it, it is very similar to programming language translation.

In general, program synthesis and program induction are very difficult and currently fail to scale to large problems. This is well-understood, and these fields are mature enough so that we have textbooks on them and how difficult they are (e.g. Gulwani et al., 2017). Meanwhile, programming language translation is very challenging too. In practice, translating between common languages (e.g. Python and Java) is only partially automatable and relies on many hand-coded rules (Qiu, 1999), and using large language models has had very limited successes (Roziere et al.). And in cases like ...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

February 16, 2023

AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper

18 minutes

Part 5 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Anson Ho, Chris Olah, Neel Nanda, and Tony Wang for some discussions and comments.

TAISIC = “the AI safety interpretability community”

MI = “mechanistic interpretability”

TAISIC has reinvented, reframed, or renamed several paradigms

Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation

...more

Share AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper

Sign up to save your podcasts

AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper

AF - EIS V: Blind Spots In AI Safety Interpretability Research by Stephen Casper