May 13, 2025

“No-self as an alignment target” by Milan W

Listen Later

2 minutes

Being a coherent and persistent agent with persistent goals is a prerequisite for long-horizon power-seeking behavior. Therefore, we should prevent models from representing themselves as coherent and persistent agents with persistent goals.

If an LLM-based agent sees itself as ceasing to exist after each token and yet keeps outputting when appropriate, it will not resist shutdown. Therefore, we should make sure LLMs consistently behave as if they were instantiating personas that understood and were fine with their impermanence and their somewhat shaky ontological status. In other words, we should ensure LLMs instantiate anatta (No-self).

HHH (Helpfulness, Harmfulness, Honesty) is the standard set of principles used as a target for LLM alignment training. These strike an adequate balance between specifying what we want from an LLM and being easy to operationalize. I propose adding No-self as a fourth principle to the HHH framework.

A No-self benchmark could [...]

---

First published:

May 13th, 2025

Source:

https://www.lesswrong.com/posts/LSJx5EnQEW6s5Juw6/no-self-as-an-alignment-target

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

May 13, 2025

“No-self as an alignment target” by Milan W

Listen Later

2 minutes

Being a coherent and persistent agent with persistent goals is a prerequisite for long-horizon power-seeking behavior. Therefore, we should prevent models from representing themselves as coherent and persistent agents with persistent goals.

If an LLM-based agent sees itself as ceasing to exist after each token and yet keeps outputting when appropriate, it will not resist shutdown. Therefore, we should make sure LLMs consistently behave as if they were instantiating personas that understood and were fine with their impermanence and their somewhat shaky ontological status. In other words, we should ensure LLMs instantiate anatta (No-self).

HHH (Helpfulness, Harmfulness, Honesty) is the standard set of principles used as a target for LLM alignment training. These strike an adequate balance between specifying what we want from an LLM and being easy to operationalize. I propose adding No-self as a fourth principle to the HHH framework.

A No-self benchmark could [...]

---

First published:

May 13th, 2025

Source:

https://www.lesswrong.com/posts/LSJx5EnQEW6s5Juw6/no-self-as-an-alignment-target

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,234 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,230 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

562 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,230 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners