
Sign up to save your podcasts
Or


As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity.
Our new papers focus on propensity for scheming: when models are deployed as coding agents, will they take opportunities to undermine the safeguards designed to oversee them? We tackle this using two complementary testing methods: automated auditing using simulated agentic environments, and scheming honeypot evaluations based on real internal alignment research codebases.
Automated auditing for sabotage propensities
Figure 1. Overview of the Gram auditing pipeline and an example of Gemini's overeagerness.
Our first paper introduces Gram (Gauging Realistic Agentic Misbehavior), an automated auditing framework that uses simulated agentic environments to evaluate sabotage risk.
Gram builds on Petri, an existing automated auditing framework that generates simulated multi-turn agentic environments from seed scenario descriptions. We identified three limitations of Petri for testing sabotage propensities specifically and addressed them in Gram:
---
Outline:
(00:54) Automated auditing for sabotage propensities
(03:05) Key findings
(04:50) The investigator agent
(06:11) Scheming honeypot evaluations
(07:28) Evaluation design
(08:45) Key findings
(11:24) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongAs AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity.
Our new papers focus on propensity for scheming: when models are deployed as coding agents, will they take opportunities to undermine the safeguards designed to oversee them? We tackle this using two complementary testing methods: automated auditing using simulated agentic environments, and scheming honeypot evaluations based on real internal alignment research codebases.
Automated auditing for sabotage propensities
Figure 1. Overview of the Gram auditing pipeline and an example of Gemini's overeagerness.
Our first paper introduces Gram (Gauging Realistic Agentic Misbehavior), an automated auditing framework that uses simulated agentic environments to evaluate sabotage risk.
Gram builds on Petri, an existing automated auditing framework that generates simulated multi-turn agentic environments from seed scenario descriptions. We identified three limitations of Petri for testing sabotage propensities specifically and addressed them in Gram:
---
Outline:
(00:54) Automated auditing for sabotage propensities
(03:05) Key findings
(04:50) The investigator agent
(06:11) Scheming honeypot evaluations
(07:28) Evaluation design
(08:45) Key findings
(11:24) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,279 Listeners

130 Listeners

7,248 Listeners

564 Listeners

16,340 Listeners

4 Listeners

14 Listeners

2 Listeners