LessWrong (30+ Karma)

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda


Listen Later

(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.)

TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won’t be enough for high reliability.

Introduction

There's a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in [...]

---

Outline:

(00:55) Introduction

(02:57) High Reliability Seems Unattainable

(05:12) Why Won't Interpretability be Reliable?

(07:47) The Potential of Black-Box Methods

(08:48) The Role of Interpretability

(12:02) Conclusion

The original text contained 5 footnotes which were omitted from this narration.

---

First published:

May 4th, 2025

Source:

https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,664 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,216 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,132 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners