
Sign up to save your podcasts
Or


Twitter | Microsite | Apollo Blog | OpenAI Blog | Full paper
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior.
Part of this reduction is likely caused by increased situational [...]
---
First published:
Source:
Linkpost URL:
https://antischeming.ai
---
Narrated by TYPE III AUDIO.
By LessWrongTwitter | Microsite | Apollo Blog | OpenAI Blog | Full paper
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior.
Part of this reduction is likely caused by increased situational [...]
---
First published:
Source:
Linkpost URL:
https://antischeming.ai
---
Narrated by TYPE III AUDIO.

26,319 Listeners

2,452 Listeners

8,529 Listeners

4,176 Listeners

93 Listeners

1,601 Listeners

9,936 Listeners

95 Listeners

517 Listeners

5,509 Listeners

15,918 Listeners

552 Listeners

131 Listeners

93 Listeners

466 Listeners