Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Summary:
We think a lot about aligning AGI with human values. I think it's more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.
This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is [...]
---
Outline:
(01:35) Overview/Intuition
(01:39) How to use instruction-following AGI as a collaborator in alignment
(03:02) Instruction-following is safer than value alignment in a slow takeoff
(05:00) Relation to existing alignment approaches
(08:45) DWIMAC as goal target - more precise definition
(11:25) Intuition: a good employee follows instructions as they were intended
(14:30) Alignment difficulties reduced:
(14:34) Learning from examples is not precise enough to reliably convey alignment goals
(15:11) Solving ethics well enough to launch sovereign AGI is hard.
(15:37) Alignment difficulties remaining or made worse:
(15:42) Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.
(17:24) Power remains in the hands of humans
(19:33) Well that just sounds like slavery with extra steps
(20:19) Maximizing goal following my be risky
(21:25) Conclusion
The original text contained 8 footnotes which were omitted from this narration.
---