LessWrong (30+ Karma)

“Ethics-Based Refusals Without Ethics-Based Refusal Training” by 1a3orn


Listen Later

(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)

TLDR

Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."

Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"

Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?

As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.

This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.

Thus [...]

---

Outline:

(00:19) TLDR

(01:25) Introduction

(03:50) Avoiding Data Leakage from Pretraining

(06:49) Training a Catholic LLM

(07:40) Training Data

(11:39) Catholic Refusals without Catholic Refusal Training

(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data

(18:07) Training a Gramenist LLM

(24:10) Conclusion

---

First published:

September 23rd, 2025

Source:

https://www.lesswrong.com/posts/xEAtKKyQ3pwkaFrNc/ethics-based-refusals-without-ethics-based-refusal-training

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,214 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,239 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,276 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners