July 24, 2025

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

11 minutes

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

As AI systems become more powerful, we need scalable ways to assess their alignment.

Human alignment audits take time and are hard to validate.

Our solution: automating alignment auditing with AI agents.

Read more: https://alignment.anthropic.com/2025/automated-auditing/

Our [...]

---

Outline:

(00:45) Thread

(03:53) Introduction

---

First published:

July 24th, 2025

Source:

https://www.lesswrong.com/posts/DJAZHYjWxMrcd2na3/building-and-evaluating-alignment-auditing-agents

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

July 24, 2025

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

11 minutes

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

As AI systems become more powerful, we need scalable ways to assess their alignment.

Human alignment audits take time and are hard to validate.

Our solution: automating alignment auditing with AI agents.

Read more: https://alignment.anthropic.com/2025/automated-auditing/

Our [...]

---

Outline:

(00:45) Thread

(03:53) Introduction

---

First published:

July 24th, 2025

Source:

https://www.lesswrong.com/posts/DJAZHYjWxMrcd2na3/building-and-evaluating-alignment-auditing-agents

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,275 Listeners

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat

7,238 Listeners

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show

16,257 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

Sign up to save your podcasts

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

“Building and evaluating alignment auditing agents” by Sam Marks, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi