February 03, 2026

“Three ways to make Claude’s constitution better” by Parv Mahajan

3 minutes

The evening after Claude's new constitution was published, about 15 AI safety FTEs and Astra fellows discussed the constitution, its weaknesses, and its implications. After the discussion, I compiled some of their most compelling recommendations:

Increase transparency about the character training process.
Much of the document is purposefully hedged and vague in its exact prescriptions; therefore, the training process used to instill the constitution is extremely load-bearing. We wish more of this information was in the accompanying blog post and supplementary material. We think it's unlikely this leaks any trade secrets, because even a blogpost-level overview, the kind given with the constitution in 2023, would provide valuable information to external researchers.

High-level overview of Constitutional AI from https://www.anthropic.com/news/claudes-constitution

We’re also interested in seeing more empirical data on behavioral changes as a result of the new constitution. For instance, would fine-tuning on the corrigibility section reduce alignment faking by Claude 3 Opus? We’d be interested in more evidence showing if, and how, the constitution improved apparent alignment.

Increase data on edge-case behavior.
Expected behavior in several edge cases (e.g., action boundaries when the principal hierarchy is illegitimate) is extremely unclear. While Claude is expected to at most conscientiously object [...]

---

First published:

February 2nd, 2026

Source:

https://www.lesswrong.com/posts/SC4Zsr6hxspKEMqmR/three-ways-to-make-claude-s-constitution-better

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more