May 16, 2024

“Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd

22 minutes

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary:

We think a lot about aligning AGI with human values. I think it's more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.

This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is [...]

---

Outline:

(01:35) Overview/Intuition

(01:39) How to use instruction-following AGI as a collaborator in alignment

(03:02) Instruction-following is safer than value alignment in a slow takeoff

(05:00) Relation to existing alignment approaches

(08:45) DWIMAC as goal target - more precise definition

(11:25) Intuition: a good employee follows instructions as they were intended

(14:30) Alignment difficulties reduced:

(14:34) Learning from examples is not precise enough to reliably convey alignment goals

(15:11) Solving ethics well enough to launch sovereign AGI is hard.

(15:37) Alignment difficulties remaining or made worse:

(15:42) Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.

(17:24) Power remains in the hands of humans

(19:33) Well that just sounds like slavery with extra steps

(20:19) Maximizing goal following my be risky

(21:25) Conclusion

The original text contained 8 footnotes which were omitted from this narration.

---

First published:

May 15th, 2024

Source:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

May 16, 2024

“Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd

22 minutes

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Summary:

This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is [...]

---

Outline:

(01:35) Overview/Intuition

(01:39) How to use instruction-following AGI as a collaborator in alignment

(03:02) Instruction-following is safer than value alignment in a slow takeoff

(05:00) Relation to existing alignment approaches

(08:45) DWIMAC as goal target - more precise definition

(11:25) Intuition: a good employee follows instructions as they were intended

(14:30) Alignment difficulties reduced:

(14:34) Learning from examples is not precise enough to reliably convey alignment goals

(15:11) Solving ethics well enough to launch sovereign AGI is hard.

(15:37) Alignment difficulties remaining or made worse:

(15:42) Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.

(17:24) Power remains in the hands of humans

(19:33) Well that just sounds like slavery with extra steps

(20:19) Maximizing goal following my be risky

(21:25) Conclusion

The original text contained 8 footnotes which were omitted from this narration.

---

First published:

May 15th, 2024

Source:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,035 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,293 Listeners

Dwarkesh Podcast

555 Listeners

The Ezra Klein Show

16,446 Listeners

AI Article Readings

4 Listeners

Doom Debates

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd

Sign up to save your podcasts

“Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd

“Instruction-following AGI is easier and more likely than value aligned AGI” by Seth Herd

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates

LessWrong posts by zvi