November 01, 2025

“Anthropic’s Pilot Sabotage Risk Report” by dmz

6 minutes

As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also release two reviews of this report: an internal review and an independent review by METR.

Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model's own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing

an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.

Affirmative cases of this kind are uncharted territory for the field: While we have published sketches of arguments we might use, we have never prepared a [...]

---

First published:

October 30th, 2025

Source:

https://www.lesswrong.com/posts/omRf5fNyQdvRuMDqQ/anthropic-s-pilot-sabotage-risk-report-2

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong