March 16, 2025

“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

24 minutes

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]

---

Outline:

(00:26) Abstract

(01:48) Twitter thread

(04:55) Blog post

(07:55) Training a language model with a hidden objective

(11:00) A blind auditing game

(15:29) Alignment auditing techniques

(15:55) Turning the model against itself

(17:52) How much does AI interpretability help?

(22:49) Conclusion

(23:37) Join our team

The original text contained 5 images which were described by AI.

---

First published:
March 13th, 2025

Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives

---

Narrated by TYPE III AUDIO.

---

Images from the article:

...more