LessWrong (30+ Karma)

“Confusion around the term reward hacking” by ariana_azarbal


Listen Later

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.

Distinct phenomena qualify as reward hacking

The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]

Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:

  1. A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
  2. A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
  3. A language [...]

---

Outline:

(00:38) Distinct phenomena qualify as reward hacking

(03:46) These phenomena can coincide but can come apart

(04:40) Misspecified-reward exploitation without task gaming

(05:26) Task gaming without misspecified-reward exploitation

(06:54) These phenomena have distinct threat models

(06:58) Just task gaming

(07:45) Task gaming from misspecified-reward exploitation

(08:27) Just misspecified-reward exploitation

(08:49) Recommendation

(09:21) When the lines could blur

(10:35) Acknowledgments

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 20th, 2026

Source:

https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,194 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,229 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

563 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,198 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners