
Sign up to save your podcasts
Or
As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the Frontier Safety Framework, focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.
We present two pieces of research, focusing on (1) understanding and testing model capabilities necessary for scheming, and (2) stress-testing chain-of-thought monitoring as a proposed defense mechanism for future, more capable systems.
Establishing and Assessing Preconditions for Scheming in Current Frontier Models
Our paper “Evaluating Frontier Models for Stealth and Situational Awareness” focuses on an empirical assessment of the [...]
---
Outline:
(01:00) Establishing and Assessing Preconditions for Scheming in Current Frontier Models
(05:45) Chain-of-Thought Monitoring: A Promising Defense Against Future Scheming
(09:01) Looking Ahead
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to prevent it from taking misaligned action. Our initial approach, as laid out in the Frontier Safety Framework, focuses on understanding current scheming capabilities of models and developing chain-of-thought monitoring mechanisms to oversee models once they are capable of scheming.
We present two pieces of research, focusing on (1) understanding and testing model capabilities necessary for scheming, and (2) stress-testing chain-of-thought monitoring as a proposed defense mechanism for future, more capable systems.
Establishing and Assessing Preconditions for Scheming in Current Frontier Models
Our paper “Evaluating Frontier Models for Stealth and Situational Awareness” focuses on an empirical assessment of the [...]
---
Outline:
(01:00) Establishing and Assessing Preconditions for Scheming in Current Frontier Models
(05:45) Chain-of-Thought Monitoring: A Promising Defense Against Future Scheming
(09:01) Looking Ahead
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,469 Listeners
2,395 Listeners
7,928 Listeners
4,142 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
417 Listeners
5,448 Listeners
15,237 Listeners
481 Listeners
121 Listeners
75 Listeners
461 Listeners