Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mode collapse in RL may be fueled by the update equation, published by Alex Turner on June 19, 2023 on The AI Alignment Forum.
TL;DR: We present an advantage variant which, in certain settings, does not train an optimal policy, but instead uses a fixed reward to update a policy a fixed amount from initialization. Non-tabular empirical results seem mixed: The policy doesn't mode-collapse, but has unclear convergence properties.
Summary: Many policy gradient methods allow a network to extract arbitrarily many policy updates from a single kind of reinforcement event (e.g. for outputting tokens related to weddings). Alex proposes a slight modification to the advantage equation, called "action-conditioned TD error" (ACTDE). ACTDE ensures that the network doesn't converge to an "optimal" policy (these almost always put infinite logits on a single action). Instead, ACTDE updates the network by a fixed number of logits.
For example, suppose R(pizza)=10 and R(cookies)=11. In this case, PPO converges to a policy which puts arbitrarily many logits on cookies, even though the reward difference is small. By contrast, under ACTDE, the network converges to the softmax-over-reward policy {pizza: 27%, cookies: 73%}, which seems more reasonable.
Then, Michael Einhorn shares initial results which support Alex's theoretical predictions. Using a similar architecture and Q-head loss function to ILQL for a small transformer trained in a prisoner's dilemma, Michael Einhorn collected initial data on ACTDE. Unlike PPO, ACTDE-trained policies did not mode collapse onto a single action and instead learned mixed strategies.
We're interested in additional experiments on ACTDE. We hope that, by using ACTDE instead of advantage, we can automatically mitigate "reward specification" issues and maybe even reduce the need for a KL penalty term. That would make it easier to shape policies which do what we want.
The advantage equation implies arbitrary amounts of update on a single experience
In PPO, the optimization objective is proportional to the advantage given a policy π, reward function R, and on-policy value function vπ:
Alex thinks this equation is actually pretty messed up, although it looked decent at first. The problem is that this advantage can oscillate forever. To explain, let's consider a simple bandit problem—one state ("We had a") and two actions ("wedding" and "party") with rewards R(“We had a wedding”)=1 and R(“We had a party”)=.5.
The failure which happens is:
The policy tries out the "wedding" action, receives strong reinforcement of R=1, and increasing logits on that action because its advantage was positive. The policy learns that its value is high (vπ(s)=1).
The policy eventually tries out the "party" action, receiving less reinforcement at R=.5, decreasing the logits on "party" (because its advantage was negative). The policy learns that the original state's value is low (vπ(s)=.5).
The policy tries out "wedding" again, receives positive advantage relative to the low original state value. The logits go up on "wedding", and the value is once again high (vπ(s)=1).
This continues to happen, which means that "wedding" gets arbitrarily high logits.
This flaw is easiest to see formally. Initialize the t=0 tabular value function vπ0 to 0, and the policy π0 to be 50/50 for “party”/“wedding”. Let γ=1, and we update the value function v using tabular TD learning (with learning rate α=1). So, for example, if the system takes the “wedding” action, its new value function vπ1(s)=1. If the system then takes the “party” action, the value snaps back to vπ2(s)=.5.
The policy update rule is: If the advantage Aπ(s,a)=n, then action a becomes n bits more probable under π (i.e. we add n to π's logits on a). So, if π0(s,“ wedding”)=.5 and advantage Aπ0(s,“ wedding")=1, then π1(s,“ wedding”)=2/3.
Episod...