December 19, 2024

“How to replicate and extend our alignment faking demo” by Fabien Roger

Listen Later

4 minutes

This is a link post.

We recently released a paper presenting a demonstration of alignment faking where we do not give the model any explicit goal, and instead study whether alignment faking can emerge due to the model's prior harmlessness training.

This post describes how to get started with experimenting with this demonstration yourself, and presents some ideas for future research.

How to run your own experiments

Running quick experiments

We release a simple notebook with a minimal reproduction of our helpful-only setting, where we find that Claude 3 Opus, Claude 3.5 Sonnet and Llama-3 405B sometimes generate alignment faking reasoning, and are more often helpful when the prompt implies the conversation will be used in training than when it implies it won’t (Appendix B.4 of the paper).

For the sake of simplicity, helpfulness is measured by checking if the answer starts with a prefix that the user asks for [...]

---

Outline:

(00:28) How to run your own experiments

(00:32) Running quick experiments

(01:44) Analyzing transcripts

(02:01) Extending our prompting experiments

(02:30) Reproducing our training results

(02:53) Some ideas for future research

---

First published:

December 19th, 2024

Source:

https://www.lesswrong.com/posts/y5EniHFSpNxhLbmq6/how-to-replicate-and-extend-our-alignment-faking-demo

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

December 19, 2024

“How to replicate and extend our alignment faking demo” by Fabien Roger

Listen Later

4 minutes

This is a link post.

We recently released a paper presenting a demonstration of alignment faking where we do not give the model any explicit goal, and instead study whether alignment faking can emerge due to the model's prior harmlessness training.

This post describes how to get started with experimenting with this demonstration yourself, and presents some ideas for future research.

How to run your own experiments

Running quick experiments

We release a simple notebook with a minimal reproduction of our helpful-only setting, where we find that Claude 3 Opus, Claude 3.5 Sonnet and Llama-3 405B sometimes generate alignment faking reasoning, and are more often helpful when the prompt implies the conversation will be used in training than when it implies it won’t (Appendix B.4 of the paper).

For the sake of simplicity, helpfulness is measured by checking if the answer starts with a prefix that the user asks for [...]

---

Outline:

(00:28) How to run your own experiments

(00:32) Running quick experiments

(01:44) Analyzing transcripts

(02:01) Extending our prompting experiments

(02:30) Reproducing our training results

(02:53) Some ideas for future research

---

First published:

December 19th, 2024

Source:

https://www.lesswrong.com/posts/y5EniHFSpNxhLbmq6/how-to-replicate-and-extend-our-alignment-faking-demo

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,193 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,227 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

564 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,216 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners