Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward is the optimization target (of capabilities researchers), published by Max H on May 15, 2023 on The AI Alignment Forum.
In Reward is not the optimization target, @TurnTrout writes:
Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I think these claims are true and important, though in my own terms, I would rephrase and narrow as:
Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.
I find the claim stated this way more intuitively obvious, but also somewhat less interesting, because the restatement makes the narrowness more explicit.
In this post, I want to highlight a fact which I did not see mentioned in the original post or the comments: in the field of reinforcement learning, there are agents which are pursuing the goal of building a system which maximizes a reward function, subject to some additional constraints. These agents are the capabilities researchers designing and implementing SotA reinforcement learning algorithms and other methods to build and test the most capable, general systems across a variety of domains.
Capabilities researchers are not solely concerned with maximizing a particular reward function, since it is not very difficult or interesting to program a bot the old-fashioned way to beat a particular Atari game. For other games (e.g. Go) it is harder to beat human or machine performance by using traditional programming techniques, and trying doesn't make for a compelling AI research agenda.
Aside from the headline metric of how well a new RL method does in terms of training policies which result in a high reward when executed, RL researchers place importance on:
Efficiency, in many senses:
How much computing power does it take to train a policy to perform well in a particular domain?
How much (and what kind of) training data does it take to train a policy?
How much computing power and space (i.e. size of the model) does it take to execute the policy?
Generality of methods across domains: for example, can the same RL training process, applied to the same network architecture but with different training data, be applied to train policies and create systems which play many different types of games well? Dreamer is an example of a very general and powerful RL method.
Generality of the trained system across domains: can a single system be trained to perform well in a variety of different domains, without retraining? Gato and LLMs are examples of this kind of generality.
Why is this observation important?
In my view, current RL methods have not yet advanced to the point of creating systems which can be indisputably described as agents which has any kind of values at all.
I view most attempts to draw parallels between high-level processes that happen in current-day AI systems and human brains as looking for patterns which do not yet exist.
Speculating on what kind of agents and values current DL-paradigm RL methods might produce in the future can be valuable and important research, but I think that it is important to remain grounded about what current systems are actually doing, and to be precise with terms.
As an example of where I think a lack of grounding about current systems and methods leads to things going wrong, in Evolut...