January 10, 2023

AF - 200 COP in MI: Interpreting Reinforcement Learning by Neel Nanda

15 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Interpreting Reinforcement Learning, published by Neel Nanda on January 10, 2023 on The AI Alignment Forum.

This is the ninth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer

I’ll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here!

Motivating papers: Acquisition of Chess Knowledge in AlphaZero, Understanding RL Vision

Disclaimer: My area of expertise is language model interpretability, not reinforcement learning. These are fairly uninformed takes and I’m sure I’m missing a lot of context and relevant prior work, but this is hopefully more useful than nothing!

Motivation

Reinforcement learning is the study of how to create agents - models that can act in an environment and form strategies to get high reward. I think there are a lot of deep confusions we have about how RL systems work and how they learn, and that trying to reverse engineer these systems could teach us a lot! Some high-level questions about RL that I’d love to get clarity on:

How much do RL agents learn to actually model their environment vs just being a big bundle of learned heuristics?

In particular, how strong a difference is there between model-free and model-based systems here?

Do models explicitly reason about their future actions? What would a circuit for this even look like?

What are the training dynamics of an RL system like? What does its path to an eventual solution look like? A core hard part about training RL systems is that the best strategies tend to require multi-step plans, but taking the first step only makes sense if you know that you’ll do the next few steps correctly.

What does it look like on a circuit level when a model successfully explores and finds a new and better solution?

How much do the lessons from mechanistic interpretability transfer? Can we even reverse engineer RL systems at all?

How well do our conceptual frameworks for thinking about networks transfer to RL? Are there any things that clearly break, or new confusions that need to be grappled with?

I expect training dynamics to be a particularly thorny issue - rather than learning a fixed data distribution, whether an action is a good idea depends on which actions the model will take in future.

Further, I think that it’s fairly likely that, whatever AGI looks like, it involves a significant component of RL. And that training human-level and beyond systems to find creative solutions to optimise some reward is particularly likely to result in dangerous behaviour. Eg, where models learn the instrumental goals to seek power or to deceive us, or just more generally learn to represent and pursue sophisticated long term goals and to successfully plan towards these.

But as is, much alignment work here is fairly speculative and high level, or focused on analysing model behaviour and forming inferences. I would love to see this grounded in a richer understanding of what is actually going on inside the system. Some more specific questions around alignment that I’m keen to get clarity on:

Do models actually learn internal representations of goals? Can we identify these?

I’m particularly interested in settings with robust capabilities but misgeneralised goals - what’s going on here? Can we find any evidence for or against inner optimisation concerns?

How much are models actually forming plans internally? Can we identify this?

Do models ever simulate other agents? (Especially in a multi-agent setting)

A work I find particularly inspiring here is Tom McGrath et al’s work, Acquisition of Chess Knowled...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings