
Sign up to save your podcasts
Or


Subtitle: A reason alignment could be hard.
Written with Alek Westover and Anshul Khandelwal
This post explores what I’ll call the Alignment Drift Hypothesis:
An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select for alignment with fixed and unreliable metrics.
The intuition behind the Alignment Drift Hypothesis. The ball bouncing around is an AI system being modified (e.g. during training, or when it's learning in-context). Random variations push the model's propensities around until it becomes misaligned in ways that slip through selection criteria for alignment.I don’t think this hypothesis always holds; however, I think it points at an important cluster of threat models, and is a key reason aligning powerful AI systems could turn out to be difficult.
1. Whac-a-mole misalignment
Here's the idea illustrated in a scenario.
Several years from now, an AI system called Agent-0 is equipped with long-term memory. Agent-0 needs this memory to learn about new research topics on the fly for months. But long-term memory causes the propensities of Agent-0 to change.
Sometimes Agent-0 thinks about moral philosophy, and deliberately decides to revise its values. Other times, Agent-0 [...]
---
Outline:
(01:08) 1. Whac-a-mole misalignment
(04:00) 2. Misalignment is like entropy, maybe
(07:37) 3. Or maybe this whole drift story is wrong
(10:36) 4. What does the empirical evidence say?
(12:30) 5. Even if the alignment drift hypothesis is correct, is it a problem in practice?
(17:07) 6. Understanding alignment drift dynamics with toy models
(17:35) 6.1. Evolution Toy Model
(22:51) 6.2. Gradient Descent Toy Model
(29:42) 7. Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Redwood ResearchSubtitle: A reason alignment could be hard.
Written with Alek Westover and Anshul Khandelwal
This post explores what I’ll call the Alignment Drift Hypothesis:
An AI system that is initially aligned will generally drift into misalignment after a sufficient number of successive modifications, even if these modifications select for alignment with fixed and unreliable metrics.
The intuition behind the Alignment Drift Hypothesis. The ball bouncing around is an AI system being modified (e.g. during training, or when it's learning in-context). Random variations push the model's propensities around until it becomes misaligned in ways that slip through selection criteria for alignment.I don’t think this hypothesis always holds; however, I think it points at an important cluster of threat models, and is a key reason aligning powerful AI systems could turn out to be difficult.
1. Whac-a-mole misalignment
Here's the idea illustrated in a scenario.
Several years from now, an AI system called Agent-0 is equipped with long-term memory. Agent-0 needs this memory to learn about new research topics on the fly for months. But long-term memory causes the propensities of Agent-0 to change.
Sometimes Agent-0 thinks about moral philosophy, and deliberately decides to revise its values. Other times, Agent-0 [...]
---
Outline:
(01:08) 1. Whac-a-mole misalignment
(04:00) 2. Misalignment is like entropy, maybe
(07:37) 3. Or maybe this whole drift story is wrong
(10:36) 4. What does the empirical evidence say?
(12:30) 5. Even if the alignment drift hypothesis is correct, is it a problem in practice?
(17:07) 6. Understanding alignment drift dynamics with toy models
(17:35) 6.1. Evolution Toy Model
(22:51) 6.2. Gradient Descent Toy Model
(29:42) 7. Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.