
Sign up to save your podcasts
Or


The evening after Claude's new constitution was published, about 15 AI safety FTEs and Astra fellows discussed the constitution, its weaknesses, and its implications. After the discussion, I compiled some of their most compelling recommendations:
Increase transparency about the character training process.
Much of the document is purposefully hedged and vague in its exact prescriptions; therefore, the training process used to instill the constitution is extremely load-bearing. We wish more of this information was in the accompanying blog post and supplementary material. We think it's unlikely this leaks any trade secrets, because even a blogpost-level overview, the kind given with the constitution in 2023, would provide valuable information to external researchers.
High-level overview of Constitutional AI from https://www.anthropic.com/news/claudes-constitution
We’re also interested in seeing more empirical data on behavioral changes as a result of the new constitution. For instance, would fine-tuning on the corrigibility section reduce alignment faking by Claude 3 Opus? We’d be interested in more evidence showing if, and how, the constitution improved apparent alignment.
Increase data on edge-case behavior.
Expected behavior in several edge cases (e.g., action boundaries when the principal hierarchy is illegitimate) is extremely unclear. While Claude is expected to at most conscientiously object [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThe evening after Claude's new constitution was published, about 15 AI safety FTEs and Astra fellows discussed the constitution, its weaknesses, and its implications. After the discussion, I compiled some of their most compelling recommendations:
Increase transparency about the character training process.
Much of the document is purposefully hedged and vague in its exact prescriptions; therefore, the training process used to instill the constitution is extremely load-bearing. We wish more of this information was in the accompanying blog post and supplementary material. We think it's unlikely this leaks any trade secrets, because even a blogpost-level overview, the kind given with the constitution in 2023, would provide valuable information to external researchers.
High-level overview of Constitutional AI from https://www.anthropic.com/news/claudes-constitution
We’re also interested in seeing more empirical data on behavioral changes as a result of the new constitution. For instance, would fine-tuning on the corrigibility section reduce alignment faking by Claude 3 Opus? We’d be interested in more evidence showing if, and how, the constitution improved apparent alignment.
Increase data on edge-case behavior.
Expected behavior in several edge cases (e.g., action boundaries when the principal hierarchy is illegitimate) is extremely unclear. While Claude is expected to at most conscientiously object [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,122 Listeners

132 Listeners

7,266 Listeners

529 Listeners

16,315 Listeners

4 Listeners

14 Listeners

2 Listeners