February 17, 2026

“Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default” by Jérémy Andréoletti

5 minutes

One of the central arguments for AI existential risk goes through inner misalignment: a model trained to exhibit aligned behavior might be pursuing a different objective internally, which diverges from the intended behavior when conditions shift. This is a core claim of If Anyone Builds It, Everyone Dies: we can't reliably aim an ASI at any goal, let alone the precise target of human values.

A common source of optimism, articulated by Nora Belrose & Quintin Pope or Jan Leike, analyzed by John Wentworth, and summarized in this recent IABIED review, goes something like: current LLMs already display good moral reasoning; human values are pervasive in training data and constitute "natural abstractions" that sufficiently capable learners converge on; so alignment should be quite easy, and get easier with scale.

I think the history of LLM jailbreaking is a neat empirical test of this claim.

Jailbreaking

A jailbreak works like this:

A model is trained (RLHF, SFT, constitutional AI, etc.) to refuse harmful requests.
Someone crafts a prompt that's out of distribution (in a broad sense) relative to the safety training.
The model produces the harmful output.

This is goal misgeneralization. The model learned something during safety training that [...]

---

Outline:

(01:08) Jailbreaking

(02:59) What the alignment-by-default view would predict (I think)

(03:35) It doesnt work

(05:00) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

February 16th, 2026

Source:

https://www.lesswrong.com/posts/eCc2GX3N9D2mjZJfL/jailbreaking-is-empirical-evidence-for-inner-misalignment

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong