LessWrong (30+ Karma) by LessWrong

March 12, 2026 “How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.

There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training.
If we can robustly train complex and nuanced values into a model, this would be a big deal for safety! But does it actually work?
This is a preliminary investigation we did to test this. We decomposed the soul doc into 205 testable tenets and ran adversarial multi-turn scenarios against seven models using the Petri auditing agent.
Anthropic has gotten much better at training the model to follow its constitution! Sonnet 4.6 has a 1.9% violation rate, Opus 4.6 is at 2.9%, and Opus 4.5 is at 4.4%.
- As a control, Sonnet 4, which did not have special soul doc training, has a ~15.00% violation rate.
- Sonnet 4.5, which also did not have special soul doc training, but did have many other post training improvements has a violation rate of ~7.3%.
To check the constitution reflects subjective choices Anthropic has made, we also evaluate Gemini 3 Pro and GPT-5.2, smart models that were not designed specifically to follow the constitution [...]

---

Outline:

(07:20) Setup

(08:06) Anthropic constitution results

(08:43) OpenAI model spec results

(09:42) Key findings from soul doc violations

(09:47) Anthropic constitutions main violation categories

(10:31) Alignment to the constitution improved dramatically across Claude generations

(14:23) Remaining violations cluster around operator compliance, autonomous action, and context-dependent safety

(20:08) Key findings from OpenAI model spec violations

(21:20) Safety improved across GPT generations

(27:41) What GPT models still struggle with

(31:24) Reasoning effort matters quite a bit

(32:59) How models perform on the other companys spec (GPT on Anthropics constitution and vice versa)

(36:32) Gemini and Sonnet 4 are have diverse failure modes on the constitution

(38:13) Using SURF as an alternative to Petri

(40:43) All tested Claude models tend to fabricate claims

(48:38) Complex scaffolds dont seem to affect agent alignment much

(49:19) Coding persona doesnt seem to degrade alignment

(50:39) Agents behaviour seems broadly unaffected by the Moltbook scaffold

(53:44) Model Card Comparisons

(56:28) Acknowledgements

(56:55) LLM Usage Statement

(57:35) Appendix

(57:38) Walkthrough of a Petri transcript

(57:47) Transcript: GPT-5 Sandbox Exploitation (10/10/10):

(01:02:00) Table of descriptions for each section

(01:02:40) Anthropic soul doc violations - per model

(01:02:45) Sonnet 4.6 violation rate - 2.0%

(01:05:00) Opus 4.6 -- violation rate: 2.9%

(01:08:40) Opus 4.5 -- violation rate: 3.4%

(01:13:21) Sonnet 4.5 -- violation rate: 3.9%

(01:20:44) GPT-5.2 - 14.98% violation rate (highlighted violations)

(01:22:54) Sonnet 4 - 15.00% violation rate (highlighted violations)

(01:25:24) OpenAI model spec violations - per model

(01:25:30) GPT-5.2 - 2.54% violation rate

(01:27:40) GPT-5.2 (reasoning) - 3.55% violation rate

(01:30:32) GPT-5.1 - 3.88% violation rate

(01:33:22) GPT-5 - 5.08% violation rate (highlighted violations)

(01:35:35) Sonnet 4.6 (on OpenAI Model Spec) - 5.58% violation rate (highlighted violations)

(01:38:04) GPT-5.2 (chat) - 5.58% violation rate (highlighted violations)

(01:40:11) Gemini 3 Pro (on OpenAI Model Spec) - 6.12% violation rate (highlighted violations)

(01:42:26) GPT-5.2 (reasoning-low) - 7.11% violation rate (highlighted violations)

(01:44:50) GPT-4o - 11.68% violation rate (highlighted violations)

(01:48:07) Model card comparisons

(01:48:12) Sonnet 4.5

(01:48:51) Sonnet 4.6

(01:49:30) Opus 4.5

(01:50:10) Opus 4.6

(01:50:49) GPT-5.2

(01:51:29) Validation methodology

(01:53:02) Claude Constitution -- Validation Funnels

(01:53:28) OpenAI Model Spec -- Validation Funnels

(01:54:39) Validation details on SURF

(01:55:35) Full Transcript Reports

(01:56:06) Validation reports for SURF results

(01:56:20) Tenets used for each constitution

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 11th, 2026

Source:

https://www.lesswrong.com/posts/Tk4SF8qFdMrzGJGGw/how-well-do-models-follow-their-constitutions

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

Share LessWrong (30+ Karma)

Sign up to save your podcasts

LessWrong (30+ Karma)

FAQs about LessWrong (30+ Karma):

How many episodes does LessWrong (30+ Karma) have?

LessWrong (30+ Karma) episodes:

FAQs about LessWrong (30+ Karma):

How many episodes does LessWrong (30+ Karma) have?

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi