January 12, 2023

"Models Don't 'Get Reward'" by Sam Ringer

10 minutes

---
client: lesswrong
project_id: curated
feed_id: ai, ai_safety, ai_safety__technical
narrator: pw
qa: km
narrator_time: 1h05m
qa_time: 0h10m
---
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight.

When thinking about deception and RLHF training, a simplified threat model is something like this:

A model takes some actions.
If a human approves of these actions, the human gives the model some reward.
Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
Models will take advantage of this so they can get more reward.
Models will therefore become deceptive.

Before continuing, I would encourage you to really engage with the above. Does it make sense to you? Is it making any hidden assumptions? Is it missing any steps? Can you rewrite it to be more mechanistically correct?

I believe that when people use the above threat model, they are either using it as shorthand for something else or they misunderstand how reinforcement learning works. Most alignment researchers will be in the former category. However, I was in the latter.

Original article:
https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

View all episodes

By TYPE III AUDIO

January 12, 2023

"Models Don't 'Get Reward'" by Sam Ringer

10 minutes

A model takes some actions.
If a human approves of these actions, the human gives the model some reward.
Humans can be deceived into giving reward in situations where they would otherwise not if they had more knowledge.
Models will take advantage of this so they can get more reward.
Models will therefore become deceptive.

Share feedback on this narration.

...more

Share "Models Don't 'Get Reward'" by Sam Ringer

Sign up to save your podcasts

"Models Don't 'Get Reward'" by Sam Ringer

"Models Don't 'Get Reward'" by Sam Ringer