September 25, 2023

AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya

12 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Impact stories for model internals: an exercise for interpretability researchers, published by Jenny Nitishinskaya on September 25, 2023 on The AI Alignment Forum.

Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity.

As part of the Alignment Workshop for AI Researchers in July/August '23, I ran a session on theories of impact for model internals. Many of the attendees were excited about this area of work, and we wanted an exercise to help them think through what exactly they were aiming for and why. This write-up came out of planning for the session, though I didn't use all this content verbatim. My main goal was to find concrete starting points for discussion, which

have the right shape to be a theory of impact

are divided up in a way that feels natural

cover the diverse reasons why people may be excited about model internals work

(according to me).

This isn't an endorsement of any of these, or of model internals research in general. The ideas on this list are due to many people, and I cite things sporadically when I think it adds useful context: feel free to suggest additional citations if you think it would help clarify what I'm referring to.

Summary of the activity

During the session, participants identified which impact stories seemed most exciting to them. We discussed why they felt excited, what success might look like concretely, how it might fail, what other ideas are related, etc. for a couple of those items. I think categorizing existing work based on its theory of impact could also be a good exercise in the future.

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

Key stats of an impact story

Applications of model internals vary a lot along multiple axes:

Level of human understanding needed for the application

If a lot of human understanding is needed, does that update you on the difficulty of executing in this direction? If understanding is not needed, does that open up possibilities for non-understanding-based methods you hadn't considered?

For example, determining whether the model does planning would probably require understanding. On the other hand, finding adversarial examples or eliciting latent knowledge might not involve any.

Level of rigor or completeness (in terms of % model explained) needed for the application

If a high level of rigor or completeness is needed, does that update you on the difficulty of executing in this direction? What does the path to high rigor/completeness look like? Can you think of modifications to the impact story that might make partial progress be more useful?

For example, we get value out of finding adversarial examples or dangerous capabilities, even if the way we find them is somewhat hacky. Meanwhile, if we don't find them, we'd need to be extremely thorough to be sure they don't exist, or sufficiently rigorous to get a useful bound on how likely the model is to be dangerous.

Is using model internals essential for the application, or are there many possible approaches to the application, only some of which make use of model internals?

Steering model behaviors can be done via model editing, or by prompting or finetuning; but, there are reasons (mentioned below) why editing could be a better approach.

Many impact stories (at least as I've categorized them) have variants that live at multiple points on these spectra. When thinking about one, you should think about where it lands, and what variants you can think of that might be e.g. easier but still useful.

The list

Some are more fleshed out than others; some of the rest could be fleshed out with a bit more effort, while others are more ...

...more

View all episodes

By The Nonlinear Fund

September 25, 2023

AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya

12 minutes

Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity.

have the right shape to be a theory of impact

are divided up in a way that feels natural

cover the diverse reasons why people may be excited about model internals work

(according to me).

Summary of the activity

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

Key stats of an impact story

Applications of model internals vary a lot along multiple axes:

Level of human understanding needed for the application

For example, determining whether the model does planning would probably require understanding. On the other hand, finding adversarial examples or eliciting latent knowledge might not involve any.

Level of rigor or completeness (in terms of % model explained) needed for the application

Is using model internals essential for the application, or are there many possible approaches to the application, only some of which make use of model internals?

Steering model behaviors can be done via model editing, or by prompting or finetuning; but, there are reasons (mentioned below) why editing could be a better approach.

The list

Some are more fleshed out than others; some of the rest could be fleshed out with a bit more effort, while others are more ...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya

Sign up to save your podcasts

AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya

AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast