
Sign up to save your podcasts
Or


Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.
Distinct phenomena qualify as reward hacking
The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]
Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:
---
Outline:
(00:38) Distinct phenomena qualify as reward hacking
(03:46) These phenomena can coincide but can come apart
(04:40) Misspecified-reward exploitation without task gaming
(05:26) Task gaming without misspecified-reward exploitation
(06:54) These phenomena have distinct threat models
(06:58) Just task gaming
(07:45) Task gaming from misspecified-reward exploitation
(08:27) Just misspecified-reward exploitation
(08:49) Recommendation
(09:21) When the lines could blur
(10:35) Acknowledgments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongSummary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.
Distinct phenomena qualify as reward hacking
The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]
Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:
---
Outline:
(00:38) Distinct phenomena qualify as reward hacking
(03:46) These phenomena can coincide but can come apart
(04:40) Misspecified-reward exploitation without task gaming
(05:26) Task gaming without misspecified-reward exploitation
(06:54) These phenomena have distinct threat models
(06:58) Just task gaming
(07:45) Task gaming from misspecified-reward exploitation
(08:27) Just misspecified-reward exploitation
(08:49) Recommendation
(09:21) When the lines could blur
(10:35) Acknowledgments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,194 Listeners

131 Listeners

7,229 Listeners

563 Listeners

16,198 Listeners

4 Listeners

14 Listeners

2 Listeners