
Sign up to save your podcasts
Or


As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also release two reviews of this report: an internal review and an independent review by METR.
Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model's own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing
an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.
Affirmative cases of this kind are uncharted territory for the field: While we have published sketches of arguments we might use, we have never prepared a [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongAs practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also release two reviews of this report: an internal review and an independent review by METR.
Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model's own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing
an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.
Affirmative cases of this kind are uncharted territory for the field: While we have published sketches of arguments we might use, we have never prepared a [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,330 Listeners

2,453 Listeners

8,557 Listeners

4,182 Listeners

93 Listeners

1,601 Listeners

9,927 Listeners

95 Listeners

511 Listeners

5,512 Listeners

15,931 Listeners

545 Listeners

131 Listeners

94 Listeners

467 Listeners