
Sign up to save your podcasts
Or


TL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).
Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.
Introduction.We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big ways:
---
Outline:
(03:31) Mechanistic anomaly detection as an alternative to interpretability: a toy example
(10:31) General setup
(12:51) Methods for mechanistic anomaly detection
(15:05) Stories on applying MAD to AI safety
(16:09) Auditing an AI company
(18:26) Coordinated AI coup
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongTL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).
Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.
Introduction.We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big ways:
---
Outline:
(03:31) Mechanistic anomaly detection as an alternative to interpretability: a toy example
(10:31) General setup
(12:51) Methods for mechanistic anomaly detection
(15:05) Stories on applying MAD to AI safety
(16:09) Auditing an AI company
(18:26) Coordinated AI coup
---
First published:
Source:
Narrated by TYPE III AUDIO.

113,219 Listeners

130 Listeners

7,258 Listeners

531 Listeners

16,303 Listeners

4 Listeners

14 Listeners

2 Listeners