LessWrong (30+ Karma)

“AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian


Listen Later

TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.

Introduction

Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?

Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for [...]

---

Outline:

(00:40) Introduction

(02:50) AuditBench: A Benchmark for Alignment Auditing

(03:55) Training Hidden Behaviors

(04:11) Phase 1: Behavior Instillation

(05:03) Phase 2: Adversarial Training

(06:27) Evaluating AuditBench Models

(08:23) The Investigator Agent

(09:48) Auditing Tools

(10:59) Results

(11:51) Finding 1: Prompting alone surfaces many behaviors, but not all

(12:32) Finding 2: Scaffolded black-box tools achieve the highest success rates

(13:08) Finding 3: White-box tools provide some signal but are not the best performers

(13:56) Finding 4: Training method dramatically affects auditability

(14:24) Case Studies: When Tools Make the Difference

(15:20) Case study 1: Scaffolded User Sampling reveals Contextual Optimism

(16:24) Case study 2: Sparse Autoencoders reveal Anti AI Regulation

(17:17) The Tool-to-Agent Gap

(19:32) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 10th, 2026

Source:

https://www.lesswrong.com/posts/LqDjxSceFz8tjMe2j/auditbench-evaluating-alignment-auditing-techniques-on

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners