April 10, 2026

“Claude Mythos: The System Card” by Zvi

1 hour 46 minutes

Claude Mythos is different.

This is the first model other than GPT-2 that is at first not being released for public use at all.

With GPT-2 the delay was due to a general precautionary principle. OpenAI did not know what they had, or what effect on demand text would have on various systems. It sounds funny now, GPT-2 was harmless, but at the time the concern was highly reasonable.

The decision not to release Claude Mythos is not about an amorphous fear. If given to anyone with a credit card, Claude Mythos would give attackers a cornucopia of zero-day exploits for essentially all the software on Earth, including every major operating system and browser. It would be chaos.

Or, in theory, if Anthropic had chosen to do so, it could have used those exploits. Great power was on offer, and that power was refused. This does not happen often.

Instead Anthropic has created Project Glasswing. Mythos is being given only to cybersecurity firms, so they can patch the world's most important software. Based on how that goes, we can then decide if and when it will become reasonable to give access to a broader [...]

---

Outline:

(03:24) Mundane Alignment Is Excellent

(05:01) Would This Process Be Sufficient To Find A Dangerous Model?

(06:27) Introductory Warning About Superficial Mundane Alignment

(15:12) Model Training (1.1)

(15:25) Release Decision Process (1.2)

(17:50) RSP Evaluations (2.1 and 2.2)

(22:17) Autonomy Evaluations (2.3)

(25:56) The Alignment Risk Update Document

(26:39) The Threat Model

(29:18) Misalignment As Failure Mode

(31:35) Wouldnt You Know?

(33:40) Dont Encourage Your Model

(35:14) Beware Goodharts Law

(37:18) Beware The Most Forbidden Technique (5.2.3)

(41:44) Asking The Right Questions

(43:11) Model Organism Tests

(45:01) Model Weight Security (Risk Report 5.5.2.1)

(45:31) Reward Hacking (Back to The Model Card)

(45:56) Remote Drop-In Worker Coming Soon

(49:01) External Testing (2.3.7)

(49:37) Cyber Insecurity General Principle Interlude

(50:46) Alignment (4)

(56:38) Risk In The Room

(57:56) Mythos Meant Well

(01:00:20) Risk Not In The Room

(01:02:05) Alignment Testing Overview

(01:05:20) Internal Deployment Testing Process

(01:07:55) Reports From Pilot Use (4.2.1)

(01:08:30) Reports From Automated Testing (4.2)

(01:10:13) Other External Testing

(01:10:56) Just The Facts, Sir

(01:13:05) Refusing Safety Research

(01:14:12) Claude Favoritism

(01:15:19) Ruling Out Encoded Thinking (4.4.1)

(01:18:41) Sandbagging (4.4.2)

(01:21:27) Capability for Evasion of Safeguards (4.4.3)

(01:23:04) Pick A Random Number (4.4.3.4)

(01:25:49) White Box Analysis (4.5)

(01:30:30) Model Welfare (5)

(01:31:32) Key Model Welfare Findings (5.1.2)

(01:41:17) Is Mythos Okay?

(01:43:52) Self-Play

(01:45:30) A Few Fun Facts

---

First published:

April 9th, 2026

Source:

https://www.lesswrong.com/posts/EDQhwLTyTnNmaxRGq/claude-mythos-the-system-card

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

April 10, 2026

“Claude Mythos: The System Card” by Zvi

1 hour 46 minutes

Claude Mythos is different.

This is the first model other than GPT-2 that is at first not being released for public use at all.

Or, in theory, if Anthropic had chosen to do so, it could have used those exploits. Great power was on offer, and that power was refused. This does not happen often.

---

Outline:

(03:24) Mundane Alignment Is Excellent

(05:01) Would This Process Be Sufficient To Find A Dangerous Model?

(06:27) Introductory Warning About Superficial Mundane Alignment

(15:12) Model Training (1.1)

(15:25) Release Decision Process (1.2)

(17:50) RSP Evaluations (2.1 and 2.2)

(22:17) Autonomy Evaluations (2.3)

(25:56) The Alignment Risk Update Document

(26:39) The Threat Model

(29:18) Misalignment As Failure Mode

(31:35) Wouldnt You Know?

(33:40) Dont Encourage Your Model

(35:14) Beware Goodharts Law

(37:18) Beware The Most Forbidden Technique (5.2.3)

(41:44) Asking The Right Questions

(43:11) Model Organism Tests

(45:01) Model Weight Security (Risk Report 5.5.2.1)

(45:31) Reward Hacking (Back to The Model Card)

(45:56) Remote Drop-In Worker Coming Soon

(49:01) External Testing (2.3.7)

(49:37) Cyber Insecurity General Principle Interlude

(50:46) Alignment (4)

(56:38) Risk In The Room

(57:56) Mythos Meant Well

(01:00:20) Risk Not In The Room

(01:02:05) Alignment Testing Overview

(01:05:20) Internal Deployment Testing Process

(01:07:55) Reports From Pilot Use (4.2.1)

(01:08:30) Reports From Automated Testing (4.2)

(01:10:13) Other External Testing

(01:10:56) Just The Facts, Sir

(01:13:05) Refusing Safety Research

(01:14:12) Claude Favoritism

(01:15:19) Ruling Out Encoded Thinking (4.4.1)

(01:18:41) Sandbagging (4.4.2)

(01:21:27) Capability for Evasion of Safeguards (4.4.3)

(01:23:04) Pick A Random Number (4.4.3.4)

(01:25:49) White Box Analysis (4.5)

(01:30:30) Model Welfare (5)

(01:31:32) Key Model Welfare Findings (5.1.2)

(01:41:17) Is Mythos Okay?

(01:43:52) Self-Play

(01:45:30) A Few Fun Facts

---

First published:

April 9th, 2026

Source:

https://www.lesswrong.com/posts/EDQhwLTyTnNmaxRGq/claude-mythos-the-system-card

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,121 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,244 Listeners

Dwarkesh Podcast

551 Listeners

The Ezra Klein Show

16,525 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Claude Mythos: The System Card” by Zvi

Sign up to save your podcasts

“Claude Mythos: The System Card” by Zvi

“Claude Mythos: The System Card” by Zvi

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi