
Sign up to save your podcasts
Or


(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrong(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,214 Listeners

131 Listeners

7,239 Listeners

559 Listeners

16,276 Listeners

4 Listeners

14 Listeners

2 Listeners