June 11, 2025

“A quick list of reward hacking interventions” by Alex Mallen

4 minutes

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer's intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

Make environments more robust. [...]

---

First published:
June 10th, 2025

Source:
https://www.lesswrong.com/posts/spZyuEGPzqPhnehyk/a-quick-list-of-reward-hacking-interventions

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

4.8

1212 ratings

June 11, 2025

“A quick list of reward hacking interventions” by Alex Mallen

4 minutes

Make environments more robust. [...]

---

First published:
June 10th, 2025

Source:
https://www.lesswrong.com/posts/spZyuEGPzqPhnehyk/a-quick-list-of-reward-hacking-interventions

---

Narrated by TYPE III AUDIO.

...more