This is a quick list of interventions that might help fix issues from reward hacking.
(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer's intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)
Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.
Here is the list (we don't claim novelty):
- Make environments more robust. [...]
---
First published: June 10th, 2025
Source: https://www.lesswrong.com/posts/spZyuEGPzqPhnehyk/a-quick-list-of-reward-hacking-interventions
---
Narrated by TYPE III AUDIO.