
Sign up to save your podcasts
Or
We recently released a paper presenting a demonstration of alignment faking where we do not give the model any explicit goal, and instead study whether alignment faking can emerge due to the model's prior harmlessness training.
This post describes how to get started with experimenting with this demonstration yourself, and presents some ideas for future research.
How to run your own experiments
Running quick experiments
We release a simple notebook with a minimal reproduction of our helpful-only setting, where we find that Claude 3 Opus, Claude 3.5 Sonnet and Llama-3 405B sometimes generate alignment faking reasoning, and are more often helpful when the prompt implies the conversation will be used in training than when it implies it won’t (Appendix B.4 of the paper).
For the sake of simplicity, helpfulness is measured by checking if the answer starts with a prefix that the user asks for [...]
---
Outline:
(00:28) How to run your own experiments
(00:32) Running quick experiments
(01:44) Analyzing transcripts
(02:01) Extending our prompting experiments
(02:30) Reproducing our training results
(02:53) Some ideas for future research
---
First published:
Source:
Narrated by TYPE III AUDIO.
We recently released a paper presenting a demonstration of alignment faking where we do not give the model any explicit goal, and instead study whether alignment faking can emerge due to the model's prior harmlessness training.
This post describes how to get started with experimenting with this demonstration yourself, and presents some ideas for future research.
How to run your own experiments
Running quick experiments
We release a simple notebook with a minimal reproduction of our helpful-only setting, where we find that Claude 3 Opus, Claude 3.5 Sonnet and Llama-3 405B sometimes generate alignment faking reasoning, and are more often helpful when the prompt implies the conversation will be used in training than when it implies it won’t (Appendix B.4 of the paper).
For the sake of simplicity, helpfulness is measured by checking if the answer starts with a prefix that the user asks for [...]
---
Outline:
(00:28) How to run your own experiments
(00:32) Running quick experiments
(01:44) Analyzing transcripts
(02:01) Extending our prompting experiments
(02:30) Reproducing our training results
(02:53) Some ideas for future research
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,342 Listeners
2,393 Listeners
7,949 Listeners
4,130 Listeners
87 Listeners
1,446 Listeners
8,910 Listeners
88 Listeners
372 Listeners
5,421 Listeners
15,334 Listeners
468 Listeners
122 Listeners
76 Listeners
449 Listeners