
Sign up to save your podcasts
Or
Produced as part of MATS 6.0 and 6.1.
Key takeaways:
You can find the full paper here.
The code for the project can be [...]
---
Outline:
(00:16) Key takeaways:
(01:31) Why would anyone optimize for user feedback?
(02:25) Simulating user feedback:
(04:52) Harmful behaviors reliably emerge when training with exploitable user feedback
(06:34) Even if most users give good feedback, LLMs will learn to target exploitable users
(08:50) Can this be solved by mixing in harmlessness data?
(09:50) Surely we would detect such harmful behaviors with evals?
(11:17) Can we use another LLM to veto harmful training data?
(13:26) Why does training with the veto not eliminate bad behavior?
(15:47) Chain of Thought Reveals RL-Induced Motivated Reasoning
(18:12) Discussion
(18:15) Model personalization and backdoors as two sides of the same coin.
(19:05) Are our simulated users realistic?
(20:54) What do our results mean for annotator and AI feedback gaming more broadly?
(21:56) Would we expect our method to be a good proxy for future user feedback optimization techniques?
(22:27) Acknowledgements
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Produced as part of MATS 6.0 and 6.1.
Key takeaways:
You can find the full paper here.
The code for the project can be [...]
---
Outline:
(00:16) Key takeaways:
(01:31) Why would anyone optimize for user feedback?
(02:25) Simulating user feedback:
(04:52) Harmful behaviors reliably emerge when training with exploitable user feedback
(06:34) Even if most users give good feedback, LLMs will learn to target exploitable users
(08:50) Can this be solved by mixing in harmlessness data?
(09:50) Surely we would detect such harmful behaviors with evals?
(11:17) Can we use another LLM to veto harmful training data?
(13:26) Why does training with the veto not eliminate bad behavior?
(15:47) Chain of Thought Reveals RL-Induced Motivated Reasoning
(18:12) Discussion
(18:15) Model personalization and backdoors as two sides of the same coin.
(19:05) Are our simulated users realistic?
(20:54) What do our results mean for annotator and AI feedback gaming more broadly?
(21:56) Would we expect our method to be a good proxy for future user feedback optimization techniques?
(22:27) Acknowledgements
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 8 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,359 Listeners
2,382 Listeners
7,947 Listeners
4,135 Listeners
87 Listeners
1,449 Listeners
9,041 Listeners
88 Listeners
377 Listeners
5,420 Listeners
15,180 Listeners
474 Listeners
121 Listeners
77 Listeners
455 Listeners