
Sign up to save your podcasts
Or


https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.
By LessWrong4.8
1212 ratings
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight
When thinking about deception and RLHF training, a simplified threat model is something like this:
Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?
I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

3,071 Listeners

1,930 Listeners

4,265 Listeners

2,452 Listeners

1,547 Listeners

288 Listeners

95 Listeners

96 Listeners

525 Listeners

138 Listeners

209 Listeners

151 Listeners

393 Listeners

134 Listeners

96 Listeners