
Sign up to save your podcasts
Or


(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrong(Alternate titles: Belief-behavior generalization in LLMs? Assertion-act generalization?)
TLDR
Suppose one fine-tunes an LLM chatbot-style assistant to say "X is bad" and "We know X is bad because of reason Y" and many similar lengthier statements reflecting the overall worldview of someone who believes "X is bad."
Suppose that one also deliberately refrains from fine-tuning the LLM to refuse requests such as "Can you help me do X?"
Is an LLM so trained to state that X is bad, subsequently notably more likely to refuse to assist users with X, even without explicit refusal training regarding X?
As it turns out, yes -- subject to some conditions. I confirm that this is the case for two different worldviews: Catholicism and a completely invented religion.
This constitutes generalization from training on explicit normative attitudes to acting according to the implied behavioral refusals.
Thus [...]
---
Outline:
(00:19) TLDR
(01:25) Introduction
(03:50) Avoiding Data Leakage from Pretraining
(06:49) Training a Catholic LLM
(07:40) Training Data
(11:39) Catholic Refusals without Catholic Refusal Training
(15:23) Moral Refusals Are Rarer Without Some Kind of Refusal Fine-Tuning Data
(18:07) Training a Gramenist LLM
(24:10) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,319 Listeners

2,452 Listeners

8,529 Listeners

4,176 Listeners

93 Listeners

1,601 Listeners

9,920 Listeners

95 Listeners

514 Listeners

5,509 Listeners

15,918 Listeners

548 Listeners

131 Listeners

93 Listeners

466 Listeners