
Sign up to save your podcasts
Or


Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html
Topics we discuss, and timestamps:
0:00:33 - Deceptive inflation
0:17:56 - Overjustification
0:32:48 - Bounded human rationality
0:50:46 - Avoiding these problems
1:14:13 - Dimensional analysis
1:23:32 - RLHF problems, in theory and practice
1:31:29 - Scott's research program
1:39:42 - Following Scott's research
Scott's website: https://www.scottemmons.com
Scott's X/twitter account: https://x.com/emmons_scott
When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747
Other works we discuss:
AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752
Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475
The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693
Episode art by Hamish Doodles: hamishdoodles.com
By Daniel Filan4.4
88 ratings
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html
Topics we discuss, and timestamps:
0:00:33 - Deceptive inflation
0:17:56 - Overjustification
0:32:48 - Bounded human rationality
0:50:46 - Avoiding these problems
1:14:13 - Dimensional analysis
1:23:32 - RLHF problems, in theory and practice
1:31:29 - Scott's research program
1:39:42 - Following Scott's research
Scott's website: https://www.scottemmons.com
Scott's X/twitter account: https://x.com/emmons_scott
When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747
Other works we discuss:
AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752
Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475
The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693
Episode art by Hamish Doodles: hamishdoodles.com

26,371 Listeners

2,426 Listeners

1,083 Listeners

107 Listeners

112,356 Listeners

210 Listeners

9,793 Listeners

89 Listeners

489 Listeners

5,473 Listeners

132 Listeners

16,106 Listeners

97 Listeners

209 Listeners

133 Listeners