October 06, 2024

“(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need” by Sodium

Listen Later

28 minutes

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there's a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
- It would be useful to treat heuristics as the fundamental object of study in [...]

---

Outline:

(00:37) The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

(02:07) Why would you want to use the heuristics-based framework when thinking about neural networks?

(04:01) How can interpretability win if the hypothesis is true?

(05:08) Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics

(06:33) Weak to strong winning

(09:01) Miscellaneous thoughts on interpretability with heuristics hypothesis

(11:34) What does it mean for alignment theory if the heuristics hypothesis is true?

(13:23) Empirical studies related to the heuristics hypothesis (both in support and against)

(18:00) Weaknesses in the Heuristics Hypothesis

(18:04) Some versions of the hypothesis are unfalsifiable

(18:39) The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.

(19:54) Getting heuristics that are causally related to a specific output does not necessarily help monitor a model's internal thoughts.

(20:16) Inspirations and related work that I haven’t already mentioned

(22:59) Potential next steps

(23:21) Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?

(23:53) Creating new interpretability methods that are centered around heuristics as the fundamental unit

(24:44) Using existing interpretability tools to discover heuristics

(26:37) Applying the heuristics-framework to study theoretical questions in alignment.

The original text contained 6 footnotes which were omitted from this narration.

---

First published:

October 3rd, 2024

Source:

https://www.lesswrong.com/posts/azCuKGJTecCDLBEEj/maybe-a-bag-of-heuristics-is-all-there-is-and-a-bag-of

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

October 06, 2024

“(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need” by Sodium

Listen Later

28 minutes

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there's a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
- It would be useful to treat heuristics as the fundamental object of study in [...]

---

Outline:

(00:37) The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

(02:07) Why would you want to use the heuristics-based framework when thinking about neural networks?

(04:01) How can interpretability win if the hypothesis is true?

(05:08) Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics

(06:33) Weak to strong winning

(09:01) Miscellaneous thoughts on interpretability with heuristics hypothesis

(11:34) What does it mean for alignment theory if the heuristics hypothesis is true?

(13:23) Empirical studies related to the heuristics hypothesis (both in support and against)

(18:00) Weaknesses in the Heuristics Hypothesis

(18:04) Some versions of the hypothesis are unfalsifiable

(18:39) The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.

(19:54) Getting heuristics that are causally related to a specific output does not necessarily help monitor a model's internal thoughts.

(20:16) Inspirations and related work that I haven’t already mentioned

(22:59) Potential next steps

(23:21) Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?

(23:53) Creating new interpretability methods that are centered around heuristics as the fundamental unit

(24:44) Using existing interpretability tools to discover heuristics

(26:37) Applying the heuristics-framework to study theoretical questions in alignment.

The original text contained 6 footnotes which were omitted from this narration.

---

First published:

October 3rd, 2024

Source:

https://www.lesswrong.com/posts/azCuKGJTecCDLBEEj/maybe-a-bag-of-heuristics-is-all-there-is-and-a-bag-of

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,370 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,386 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

7,925 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,134 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

Your Undivided Attention by Tristan Harris and Aza Raskin, The Center for Humane Technology

Your Undivided Attention

1,456 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,048 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

387 Listeners

Hard Fork by The New York Times

Hard Fork

5,420 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,207 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

472 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

120 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

456 Listeners