The Nonlinear Library

AF - Reward Hacking from a Causal Perspective by Tom Everitt


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Reward Hacking from a Causal Perspective, published by Tom Everitt on July 21, 2023 on The AI Alignment Forum.
Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, Post 3: Agency, and Post 4: Incentives.
By Frances Rhys Ward, Tom Everitt, Sebastian Benthall, James Fox, Matt MacDermott, Milad Kazemi, Ryan Carey representing the Causal Incentives Working Group. Thanks also to Toby Shevlane and Aliya Ahmad.
AI systems are typically trained to optimise an objective function, such as a loss or reward function. However, objective functions are sometimes misspecified in ways that allow them to be optimised without doing the intended task. This is called reward hacking. It can be contrasted with misgeneralisation, which occurs when the system extrapolates (potentially) correct feedback in unintended ways.
This post will discuss why human-provided rewards can sometimes fail to reflect what the human really wants, and why this can lead to malign incentives. It also considers several proposed solutions, all from the perspective of causal influence diagrams.
Why Humans might Reward the Wrong Behaviours
In situations where a programmatic reward function is hard to specify, AI systems can often be trained from human feedback. For example, a content recommender may be optimising for likes, and a language model trained on feedback from human raters.
Unfortunately, humans don't always reward the behaviour they actually want. For example, a human may give positive feedback for a credible-sounding summary, even though it actually misses key points:
More concerningly, the system may covertly influence the human into providing positive feedback. For example, a recommender system with the goal of maximising engagement can do so by influencing the user's preferences and mood. This leads to a kind of reward misspecification, where the human provides positive feedback for situations that don't actually bring them utility.
A causal model of the situation reveals the agent may have an instrumental control incentive (or similarly, an intention) to manipulate the user's preferences. This can be inferred directly from the graph. First, the human may be influenced by the agent's behaviour, as they must observe it before evaluating it. And, second, the agent can get better feedback by influencing the human:
For example, we typically read a post before deciding whether to "like" it. By making the user more emotional, the system may be more likely to increase engagement. While this effect is stronger for longer interactions, the incentive is there even for "single time-step" interactions.
Scalable Oversight
One proposed solution to the reward misspecification problem is scalable oversight. It provides the human with a helper agent that advises them on what feedback to give. The helper agent observes the learning agent's behaviour, and may point out, for instance, an inaccuracy in a credible-looking summary, or warn against manipulation attempts. The extra assistance may make it harder for the learning agent to manipulate or deceive the human:
Influential scalable oversight agendas include iterated distillation and amplification, AI safety via debate, recursive reward modelling, and constitutional AI.
Unfortunately, the learning agent still has an incentive to deceive the human or manipulate their preferences, as the human preferences still satisfy the graphical criterion for an instrumental control incentive (it's on a directed causal path from behaviour to feedback). Additionally, the learning agent also has an incentive to deceive the helper agent:
An important question for scalable oversight schemes is whether weaker agents can effectively help to supervise more capable agents (and whether this can be done recursively to supervise agents much smart...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings