Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Think carefully before calling RL policies "agents", published by Alex Turner on June 2, 2023 on The AI Alignment Forum.
I think agentic systems represent most of AI extinction risk. I want to think clearly about what training procedures produce agentic systems. Unfortunately, the field of reinforcement learning has a convention of calling its trained artifacts "agents." This terminology is loaded and inappropriate for my purposes. I advocate instead calling the trained system a "policy." This name is standard, accurate, and neutral.
Don't assume the conclusion by calling a policy an "agent"
The real-world systems we want to think about and align are very large neural networks like GPT-4. These networks are trained and finetuned via different kinds of self-supervised and reinforcement learning.
When a policy network is updated using a learning process, its parameters are changed via weight updates. Eventually, the process ends (assuming no online learning for simplicity). We are then left with a policy network (e.g. GPT-4). To actually use the network, we need to use some sampling procedure on its logits (e.g. top-p with a given temperature). Once we fix the policy network and sampling procedure, we get a mapping from observations (e.g. sequences of tokens like [I, love, dogs]) to probability distributions over outputs (e.g. tokens). This mapping π is the policy.
I want to carefully consider whether a trained policy will exhibit agentic cognition of various forms, including planning, goal-directedness, and situational awareness. While considering this question, we should not start calling the trained policy an "agent"! That's like a detective randomly calling one of the suspects "criminal." I prefer just calling the trained artifact a "policy." This neutrally describes the artifact's function, without connoting agentic or dangerous cognition.
Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropriate to call that kind of computation "agentic." But that designation is only appropriate after further information is discovered (e.g. how the policy in fact works).
There's no deep reason why trained policies are called "agents"
Throughout my PhD in RL theory, I accepted the idea that RL tends to create agents, and supervised learning doesn't. Well-cited papers use the term "agents", as do textbooks and Wikipedia. I also hadn't seen anyone give the pushback I give in this post.
Question: Given a fixed architecture (e.g. a 48-layer decoder-only transformer), what kinds of learning processes are more likely to train policy networks which use internal planning?
If you're like I was in early 2022, you might answer "RL trains agents." But why? In what ways do PPO's weight updates tend to accumulate into agentic circuitry, while unsupervised pretraining on OpenWebText does not?
Claim: People are tempted to answer "RL" because the field adopted the "agent" terminology for reasons unrelated to the above question. Everyone keeps using the loaded terminology because no one questions it.
Let's be clear. RL researchers did not deliberate carefully about the inductive biases of deep learning, and then decide that a certain family of algorithms was especially likely to train agentic cognition. Researchers called policies "agents" as early as 1995, before the era of deep learning (e.g. see AI: A modern approach, 1st edition).
Does RL actually produce agents?
Just because "agents" was chosen for reasons unrelated to agentic cognition, doesn't mean the name is inappropriate. I can think of a few pieces of evidence for RL entraining agentic cognition.
RL methods are often used to train networks on tasks like video games and robotics....