March 21, 2026

“Confusion around the term reward hacking” by ariana_azarbal

11 minutes

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.

Distinct phenomena qualify as reward hacking

The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]

Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:

A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
A language [...]

---

Outline:

(00:38) Distinct phenomena qualify as reward hacking

(03:46) These phenomena can coincide but can come apart

(04:40) Misspecified-reward exploitation without task gaming

(05:26) Task gaming without misspecified-reward exploitation

(06:54) These phenomena have distinct threat models

(06:58) Just task gaming

(07:45) Task gaming from misspecified-reward exploitation

(08:27) Just misspecified-reward exploitation

(08:49) Recommendation

(09:21) When the lines could blur

(10:35) Acknowledgments

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 20th, 2026

Source:

https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

March 21, 2026

“Confusion around the term reward hacking” by ariana_azarbal

11 minutes

Distinct phenomena qualify as reward hacking

The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]

A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
A language [...]

---

Outline:

(00:38) Distinct phenomena qualify as reward hacking

(03:46) These phenomena can coincide but can come apart

(04:40) Misspecified-reward exploitation without task gaming

(05:26) Task gaming without misspecified-reward exploitation

(06:54) These phenomena have distinct threat models

(06:58) Just task gaming

(07:45) Task gaming from misspecified-reward exploitation

(08:27) Just misspecified-reward exploitation

(08:49) Recommendation

(09:21) When the lines could blur

(10:35) Acknowledgments

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

March 20th, 2026

Source:

https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,194 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,229 Listeners

Dwarkesh Podcast

563 Listeners

The Ezra Klein Show

16,198 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Confusion around the term reward hacking” by ariana_azarbal

Sign up to save your podcasts

“Confusion around the term reward hacking” by ariana_azarbal

“Confusion around the term reward hacking” by ariana_azarbal

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi