LessWrong (30+ Karma)

“Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd


Listen Later

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary:

We think a lot about aligning AGI with human values. I think it's more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.

This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is [...]

---

Outline:

(01:35) Overview/Intuition

(01:39) How to use instruction-following AGI as a collaborator in alignment

(03:02) Instruction-following is safer than value alignment in a slow takeoff

(05:00) Relation to existing alignment approaches

(08:45) DWIMAC as goal target - more precise definition

(11:25) Intuition: a good employee follows instructions as they were intended

(14:30) Alignment difficulties reduced:

(14:34) Learning from examples is not precise enough to reliably convey alignment goals

(15:11) Solving ethics well enough to launch sovereign AGI is hard.

(15:37) Alignment difficulties remaining or made worse:

(15:42) Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.

(17:24) Power remains in the hands of humans

(19:33) Well that just sounds like slavery with extra steps

(20:19) Maximizing goal following my be risky

(21:25) Conclusion

The original text contained 8 footnotes which were omitted from this narration.

---

First published:

May 15th, 2024

Source:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,035 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,293 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

555 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,446 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners