
Sign up to save your podcasts
Or


2.1 Summary
In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.
I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.
Section-by-section summary:
---
Outline:
(00:16) 2.1. Summary
(03:48) 2.2. The Distributional Leap
(05:26) 2.3. A Naive Training Strategy
(07:01) 2.3.1. How this relates to current AIs
(08:26) 2.4. What might the critic learn?
(09:55) 2.4.1. Might the critic learn to score honesty highly?
(12:35) 2.4.1.1. Aside: Contrast to the human value of honesty
(13:05) 2.5. Niceness is not optimal
(14:59) 2.6. Niceness is not (uniquely) simple
(16:02) 2.6.1. Anthropomorphic Optimism
(19:26) 2.6.2. Intuitions from looking at humans may mislead you
(21:12) 2.7. Natural Abstractions or Alienness?
(21:35) 2.7.1. Natural Abstractions
(23:15) 2.7.2. ... or Alienness?
(25:54) 2.8. Value extrapolation
(27:49) 2.8.1. Coherent Extrapolated Volition
(32:48) 2.9. Conclusion
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong2.1 Summary
In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.
I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.
Section-by-section summary:
---
Outline:
(00:16) 2.1. Summary
(03:48) 2.2. The Distributional Leap
(05:26) 2.3. A Naive Training Strategy
(07:01) 2.3.1. How this relates to current AIs
(08:26) 2.4. What might the critic learn?
(09:55) 2.4.1. Might the critic learn to score honesty highly?
(12:35) 2.4.1.1. Aside: Contrast to the human value of honesty
(13:05) 2.5. Niceness is not optimal
(14:59) 2.6. Niceness is not (uniquely) simple
(16:02) 2.6.1. Anthropomorphic Optimism
(19:26) 2.6.2. Intuitions from looking at humans may mislead you
(21:12) 2.7. Natural Abstractions or Alienness?
(21:35) 2.7.1. Natural Abstractions
(23:15) 2.7.2. ... or Alienness?
(25:54) 2.8. Value extrapolation
(27:49) 2.8.1. Coherent Extrapolated Volition
(32:48) 2.9. Conclusion
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,049 Listeners

132 Listeners

7,281 Listeners

533 Listeners

16,317 Listeners

4 Listeners

14 Listeners

2 Listeners