
Sign up to save your podcasts
Or


One of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can't reliably aim an ASI at any goal, let alone the precise target of human values.
A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute "natural abstractions" that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.
I think the history of LLM jailbreaking is a neat empirical test of this claim.
Jailbreaking
A jailbreak works like this:
This is goal misgeneralization. The model learned something during safety training that [...]
---
Outline:
(01:08) Jailbreaking
(02:59) What the alignment-by-default view would predict (I think)
(03:35) It doesnt work
(05:00) Conclusion
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongOne of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can't reliably aim an ASI at any goal, let alone the precise target of human values.
A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute "natural abstractions" that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.
I think the history of LLM jailbreaking is a neat empirical test of this claim.
Jailbreaking
A jailbreak works like this:
This is goal misgeneralization. The model learned something during safety training that [...]
---
Outline:
(01:08) Jailbreaking
(02:59) What the alignment-by-default view would predict (I think)
(03:35) It doesnt work
(05:00) Conclusion
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,217 Listeners

131 Listeners

7,243 Listeners

558 Listeners

16,290 Listeners

4 Listeners

14 Listeners

2 Listeners