March 11, 2026

“AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian

20 minutes

TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.

Introduction

Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?

Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for [...]

---

Outline:

(00:40) Introduction

(02:50) AuditBench: A Benchmark for Alignment Auditing

(03:55) Training Hidden Behaviors

(04:11) Phase 1: Behavior Instillation

(05:03) Phase 2: Adversarial Training

(06:27) Evaluating AuditBench Models

(08:23) The Investigator Agent

(09:48) Auditing Tools

(10:59) Results

(11:51) Finding 1: Prompting alone surfaces many behaviors, but not all

(12:32) Finding 2: Scaffolded black-box tools achieve the highest success rates

(13:08) Finding 3: White-box tools provide some signal but are not the best performers

(13:56) Finding 4: Training method dramatically affects auditability

(14:24) Case Studies: When Tools Make the Difference

(15:20) Case study 1: Scaffolded User Sampling reveals Contextual Optimism

(16:24) Case study 2: Sparse Autoencoders reveal Anti AI Regulation

(17:17) The Tool-to-Agent Gap

(19:32) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 10th, 2026

Source:

https://www.lesswrong.com/posts/LqDjxSceFz8tjMe2j/auditbench-evaluating-alignment-auditing-techniques-on

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 11, 2026

“AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian

20 minutes

Introduction

---

Outline:

(00:40) Introduction

(02:50) AuditBench: A Benchmark for Alignment Auditing

(03:55) Training Hidden Behaviors

(04:11) Phase 1: Behavior Instillation

(05:03) Phase 2: Adversarial Training

(06:27) Evaluating AuditBench Models

(08:23) The Investigator Agent

(09:48) Auditing Tools

(10:59) Results

(11:51) Finding 1: Prompting alone surfaces many behaviors, but not all

(12:32) Finding 2: Scaffolded black-box tools achieve the highest success rates

(13:08) Finding 3: White-box tools provide some signal but are not the best performers

(13:56) Finding 4: Training method dramatically affects auditability

(14:24) Case Studies: When Tools Make the Difference

(15:20) Case study 1: Scaffolded User Sampling reveals Contextual Optimism

(16:24) Case study 2: Sparse Autoencoders reveal Anti AI Regulation

(17:17) The Tool-to-Agent Gap

(19:32) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 10th, 2026

Source:

https://www.lesswrong.com/posts/LqDjxSceFz8tjMe2j/auditbench-evaluating-alignment-auditing-techniques-on

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,326 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,321 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian

Sign up to save your podcasts

“AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian

“AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors” by abhayesian

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi