
Sign up to save your podcasts
Or


(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrong(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,371 Listeners

2,426 Listeners

8,190 Listeners

4,149 Listeners

92 Listeners

1,560 Listeners

9,793 Listeners

89 Listeners

489 Listeners

5,473 Listeners

16,106 Listeners

531 Listeners

133 Listeners

97 Listeners

509 Listeners