September 18, 2023

AF - Three ways interpretability could be impactful by Arthur Conmy

10 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Three ways interpretability could be impactful, published by Arthur Conmy on September 18, 2023 on The AI Alignment Forum.

Epistemic status: somewhat confident that overall these arguments show interpretability could be impactful, but not confident on any individual point.

Background: I wrote a draft of this post here after reading Charbel's work. Thanks to Charbel and several others' feedback, the post was improved a lot: I would definitely recommend adversarial collaboration so thanks Charbel! I'm hoping to discuss interpretability's impact further so posting now.

Three ways interpretability could be impactful

Just read the test and validation set parts if you have little time.

1. Test set interpretability: Finding alignment failures

What this is. We want to find alignment failures with interpretability that other forms of auditing do not find. Before deploying a model, we find inputs that seem to elicit new capabilities of the model. Either A) the model has correctly generalized to novel capabilities or B) the model is deceiving our evaluation process to produce convincing but false outputs. We then use our future interpretability tools to try to understand some of the computation the AI is doing to reach these new capabilities. We either become more confident that the AI has correctly generalized, or we've uncovered deceptive computation and we can retrain and/or provide robust evidence of misalignment to relevant actors.

Why this isn't achievable with other methods. If a model can deceive our evaluation process it could be very difficult to observe this deception with mere behavorial evals. See here.

In my mind test set interpretability primarily targets a specific set of alignment failures, illustrated here:

Figure 1.

In the appendix I outline my reasoning behind (interpretability's role in) Figure 1.

2. Validation set interpretability: A better science of alignment.

What this is (thanks to Neel). We have little ground truth on whether our models are misaligned now or how far methods such as RLHF will further scale. More generally, we understand little about how machine learning works, which limits our ability to reason about future systems. Interpretability could first and foremost actually provide evidence for what our alignment techniques are doing (e.g interpreting RLHF reward models) and secondly give us a better evidence base for reasoning about deep learning. I think that Progress Measures for Grokking Via Mechanistic Interpretability has already somewhat changed people's perspectives on how ML models select different algorithms (e.g here, here).

This differs from test set interpretability as it is broader and can be applied before testing potentially misaligned models, to steer the field towards better practises for alignment (Russell and Norvig's validation/test distinction here may be helpful for analogy).

Why this isn't achievable with other methods. If we want to understand how models work for safety-relevant end goals, it seems likely to me that interpretability is the best research direction to pursue. Most methods are merely behavorial and so provide limited ground truth, especially when we are uncertain about deception. For example, I think the existing work trying to make chain-of-thought faithful shows that naive prompting is likely insufficient to understand models' reasoning. Non-behavorial methods such as science of deep learning approaches (e.g singular learning theory, scaling laws) by default give high-level descriptions of neural network statistics such as loss or RLCT. I don't think these approaches are as likely to get as close to answering the questions about internal computations in AIs that I think successful interpretability could lead to. I think some other directions are worthwhile bets due to uncertainty and neglectedness, howe...

...more

View all episodes

By The Nonlinear Fund

September 18, 2023

AF - Three ways interpretability could be impactful by Arthur Conmy

10 minutes

Epistemic status: somewhat confident that overall these arguments show interpretability could be impactful, but not confident on any individual point.

Three ways interpretability could be impactful

Just read the test and validation set parts if you have little time.

1. Test set interpretability: Finding alignment failures

Why this isn't achievable with other methods. If a model can deceive our evaluation process it could be very difficult to observe this deception with mere behavorial evals. See here.

In my mind test set interpretability primarily targets a specific set of alignment failures, illustrated here:

Figure 1.

In the appendix I outline my reasoning behind (interpretability's role in) Figure 1.

2. Validation set interpretability: A better science of alignment.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Three ways interpretability could be impactful by Arthur Conmy

Sign up to save your podcasts

AF - Three ways interpretability could be impactful by Arthur Conmy

AF - Three ways interpretability could be impactful by Arthur Conmy

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast