
Sign up to save your podcasts
Or


TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.
Introduction
Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?
Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for [...]
---
Outline:
(00:40) Introduction
(02:50) AuditBench: A Benchmark for Alignment Auditing
(03:55) Training Hidden Behaviors
(04:11) Phase 1: Behavior Instillation
(05:03) Phase 2: Adversarial Training
(06:27) Evaluating AuditBench Models
(08:23) The Investigator Agent
(09:48) Auditing Tools
(10:59) Results
(11:51) Finding 1: Prompting alone surfaces many behaviors, but not all
(12:32) Finding 2: Scaffolded black-box tools achieve the highest success rates
(13:08) Finding 3: White-box tools provide some signal but are not the best performers
(13:56) Finding 4: Training method dramatically affects auditability
(14:24) Case Studies: When Tools Make the Difference
(15:20) Case study 1: Scaffolded User Sampling reveals Contextual Optimism
(16:24) Case study 2: Sparse Autoencoders reveal Anti AI Regulation
(17:17) The Tool-to-Agent Gap
(19:32) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongTL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.
Introduction
Alignment auditing—investigating AI systems to uncover hidden or unintended behaviors—is a core challenge for safe AI deployment. Recent work has explored automated investigator agents: language model agents equipped with various tools that can probe a target model for problematic behavior. But, basic questions about investigator agents remain wide open: Which tools are actually worth using? Which agent scaffolds work best? How should tools be structured to maximize their value?
Progress on these questions has been bottlenecked by a lack of standardized testbed for evaluating investigator agent performance. To address this, we're releasing AuditBench, a benchmark of 56 language models with implanted hidden behaviors spanning 14 categories—including sycophantic deference, opposition to AI regulation, and secret geopolitical loyalties—which have been trained not to confess their hidden behaviors when directly asked. The suite is intentionally diverse: we vary methods for [...]
---
Outline:
(00:40) Introduction
(02:50) AuditBench: A Benchmark for Alignment Auditing
(03:55) Training Hidden Behaviors
(04:11) Phase 1: Behavior Instillation
(05:03) Phase 2: Adversarial Training
(06:27) Evaluating AuditBench Models
(08:23) The Investigator Agent
(09:48) Auditing Tools
(10:59) Results
(11:51) Finding 1: Prompting alone surfaces many behaviors, but not all
(12:32) Finding 2: Scaffolded black-box tools achieve the highest success rates
(13:08) Finding 3: White-box tools provide some signal but are not the best performers
(13:56) Finding 4: Training method dramatically affects auditability
(14:24) Case Studies: When Tools Make the Difference
(15:20) Case study 1: Scaffolded User Sampling reveals Contextual Optimism
(16:24) Case study 2: Sparse Autoencoders reveal Anti AI Regulation
(17:17) The Tool-to-Agent Gap
(19:32) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,326 Listeners

130 Listeners

7,242 Listeners

559 Listeners

16,321 Listeners

4 Listeners

14 Listeners

2 Listeners