February 12, 2023

AF - Why almost every RL agent does learned optimization by Lee Sharkey

11 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why almost every RL agent does learned optimization, published by Lee Sharkey on February 12, 2023 on The AI Alignment Forum.

Or "Why RL≈RL2 (And why that matters)"

TL;DR: This post discusses the blurred conceptual boundary between RL and RL2 (also known as meta-RL). RL2 is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL2 emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL2 agents.

I've found myself telling this story about the relationship between RL and RL2 numerous times in conversation. When that happens, it's usually time to write a post about it. Most of the first half of the post (which points out that RL2 is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment.

The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL2 systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some.

Background on RL2

RL2 (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022).

RL2 is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm).

The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL2 exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL2 may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL2 system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL2 is therefore a phenomenon that seems likely to underlie some major safety risks.

The conditions under which RL2 emerges are the default RL training conditions

Ingredients for an RL2 cake

The four 'ingredients' required for RL2 to emerge are:

The agent must have observations that correlate with reward.

The agent must have observations that correlate with its history of actions.

The agent must have a memory state that persists through time in which the RL2 algorithm can be implemented.

The agent must be trained on a distribution of tasks.

These conditions let the agent learn an RL2 algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL2:

Thanks to (1), agents tend to learn representations that identify if the agent is getting closer to valuable states.

Thanks to...

...more