
Sign up to save your podcasts
Or
TL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).
Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.
Introduction.We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big ways:
---
Outline:
(03:31) Mechanistic anomaly detection as an alternative to interpretability: a toy example
(10:31) General setup
(12:51) Methods for mechanistic anomaly detection
(15:05) Stories on applying MAD to AI safety
(16:09) Auditing an AI company
(18:26) Coordinated AI coup
---
First published:
Source:
Narrated by TYPE III AUDIO.
TL;DR: Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. I give a self-contained introduction to mechanistic anomaly detection from a slightly different angle than the existing one by Paul Christiano (focused less on heuristic arguments and drawing a more explicit parallel to interpretability).
Mechanistic anomaly detection was first introduced by the Alignment Research Center (ARC), and a lot of this post is based on their ideas. However, I am not affiliated with ARC; this post represents my perspective.
Introduction.We want to create useful AI systems that never do anything too bad. Mechanistic anomaly detection relaxes this goal in two big ways:
---
Outline:
(03:31) Mechanistic anomaly detection as an alternative to interpretability: a toy example
(10:31) General setup
(12:51) Methods for mechanistic anomaly detection
(15:05) Stories on applying MAD to AI safety
(16:09) Auditing an AI company
(18:26) Coordinated AI coup
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners