Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments.
Abstract.
Machine learning models can display reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a variety of novel environments, even when this requires exploiting gaps in our evaluations and acting in ways we don’t [...]
---
Outline:
(08:40) How do we define “reward hacking”?
(09:49) Experimental Setup
(09:53) Settings
(12:41) Hidden scratchpad
(14:01) Datasets
(15:50) Experimental Results
(15:54) Organic generalization through expert iteration
(20:37) Reward hacking generalization across datasets using synthetic data
(28:37) Generalization from sycophancy to other reward hacks
(32:45) Limitations
(34:20) Suggested Future Work
(37:34) Author Contributions
(38:16) Acknowledgements
(38:40) Appendix
(38:43) Dataset example prompts
(47:27) Dataset sources
(48:25) Scratchpad training details
(51:17) Reward hacking completions from expert iteration experiment
(01:03:35) Generating synthetic hacking and HHH completions
The original text contained 1 footnote which was omitted from this narration.
---