This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.
- There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training.
- If we can robustly train complex and nuanced values into a model, this would be a big deal for safety! But does it actually work?
- This is a preliminary investigation we did to test this. We decomposed the soul doc into 205 testable tenets and ran adversarial multi-turn scenarios against seven models using the Petri auditing agent.
- Anthropic has gotten much better at training the model to follow its constitution! Sonnet 4.6 has a 1.9% violation rate, Opus 4.6 is at 2.9%, and Opus 4.5 is at 4.4%.
- As a control, Sonnet 4, which did not have special soul doc training, has a ~15.00% violation rate.
- Sonnet 4.5, which also did not have special soul doc training, but did have many other post training improvements has a violation rate of ~7.3%.
- To check the constitution reflects subjective choices Anthropic has made, we also evaluate Gemini 3 Pro and GPT-5.2, smart models that were not designed specifically to follow the constitution [...]
---
Outline:
(07:20) Setup
(08:06) Anthropic constitution results
(08:43) OpenAI model spec results
(09:42) Key findings from soul doc violations
(09:47) Anthropic constitutions main violation categories
(10:31) Alignment to the constitution improved dramatically across Claude generations
(14:23) Remaining violations cluster around operator compliance, autonomous action, and context-dependent safety
(20:08) Key findings from OpenAI model spec violations
(21:20) Safety improved across GPT generations
(27:41) What GPT models still struggle with
(31:24) Reasoning effort matters quite a bit
(32:59) How models perform on the other companys spec (GPT on Anthropics constitution and vice versa)
(36:32) Gemini and Sonnet 4 are have diverse failure modes on the constitution
(38:13) Using SURF as an alternative to Petri
(40:43) All tested Claude models tend to fabricate claims
(48:38) Complex scaffolds dont seem to affect agent alignment much
(49:19) Coding persona doesnt seem to degrade alignment
(50:39) Agents behaviour seems broadly unaffected by the Moltbook scaffold
(53:44) Model Card Comparisons
(56:28) Acknowledgements
(56:55) LLM Usage Statement
(57:35) Appendix
(57:38) Walkthrough of a Petri transcript
(57:47) Transcript: GPT-5 Sandbox Exploitation (10/10/10):
(01:02:00) Table of descriptions for each section
(01:02:40) Anthropic soul doc violations - per model
(01:02:45) Sonnet 4.6 violation rate - 2.0%
(01:05:00) Opus 4.6 -- violation rate: 2.9%
(01:08:40) Opus 4.5 -- violation rate: 3.4%
(01:13:21) Sonnet 4.5 -- violation rate: 3.9%
(01:20:44) GPT-5.2 - 14.98% violation rate (highlighted violations)
(01:22:54) Sonnet 4 - 15.00% violation rate (highlighted violations)
(01:25:24) OpenAI model spec violations - per model
(01:25:30) GPT-5.2 - 2.54% violation rate
(01:27:40) GPT-5.2 (reasoning) - 3.55% violation rate
(01:30:32) GPT-5.1 - 3.88% violation rate
(01:33:22) GPT-5 - 5.08% violation rate (highlighted violations)
(01:35:35) Sonnet 4.6 (on OpenAI Model Spec) - 5.58% violation rate (highlighted violations)
(01:38:04) GPT-5.2 (chat) - 5.58% violation rate (highlighted violations)
(01:40:11) Gemini 3 Pro (on OpenAI Model Spec) - 6.12% violation rate (highlighted violations)
(01:42:26) GPT-5.2 (reasoning-low) - 7.11% violation rate (highlighted violations)
(01:44:50) GPT-4o - 11.68% violation rate (highlighted violations)
(01:48:07) Model card comparisons
(01:48:12) Sonnet 4.5
(01:48:51) Sonnet 4.6
(01:49:30) Opus 4.5
(01:50:10) Opus 4.6
(01:50:49) GPT-5.2
(01:51:29) Validation methodology
(01:53:02) Claude Constitution -- Validation Funnels
(01:53:28) OpenAI Model Spec -- Validation Funnels
(01:54:39) Validation details on SURF
(01:55:35) Full Transcript Reports
(01:56:06) Validation reports for SURF results
(01:56:20) Tenets used for each constitution
The original text contained 1 footnote which was omitted from this narration.
---