
Sign up to save your podcasts
Or
Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there's a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”
The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
---
Outline:
(00:37) The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
(02:07) Why would you want to use the heuristics-based framework when thinking about neural networks?
(04:01) How can interpretability win if the hypothesis is true?
(05:08) Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics
(06:33) Weak to strong winning
(09:01) Miscellaneous thoughts on interpretability with heuristics hypothesis
(11:34) What does it mean for alignment theory if the heuristics hypothesis is true?
(13:23) Empirical studies related to the heuristics hypothesis (both in support and against)
(18:00) Weaknesses in the Heuristics Hypothesis
(18:04) Some versions of the hypothesis are unfalsifiable
(18:39) The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.
(19:54) Getting heuristics that are causally related to a specific output does not necessarily help monitor a model's internal thoughts.
(20:16) Inspirations and related work that I haven’t already mentioned
(22:59) Potential next steps
(23:21) Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?
(23:53) Creating new interpretability methods that are centered around heuristics as the fundamental unit
(24:44) Using existing interpretability tools to discover heuristics
(26:37) Applying the heuristics-framework to study theoretical questions in alignment.
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there's a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”
The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
---
Outline:
(00:37) The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
(02:07) Why would you want to use the heuristics-based framework when thinking about neural networks?
(04:01) How can interpretability win if the hypothesis is true?
(05:08) Corollary: Understanding neural network computation do not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics
(06:33) Weak to strong winning
(09:01) Miscellaneous thoughts on interpretability with heuristics hypothesis
(11:34) What does it mean for alignment theory if the heuristics hypothesis is true?
(13:23) Empirical studies related to the heuristics hypothesis (both in support and against)
(18:00) Weaknesses in the Heuristics Hypothesis
(18:04) Some versions of the hypothesis are unfalsifiable
(18:39) The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.
(19:54) Getting heuristics that are causally related to a specific output does not necessarily help monitor a model's internal thoughts.
(20:16) Inspirations and related work that I haven’t already mentioned
(22:59) Potential next steps
(23:21) Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?
(23:53) Creating new interpretability methods that are centered around heuristics as the fundamental unit
(24:44) Using existing interpretability tools to discover heuristics
(26:37) Applying the heuristics-framework to study theoretical questions in alignment.
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,370 Listeners
2,386 Listeners
7,925 Listeners
4,134 Listeners
87 Listeners
1,456 Listeners
9,048 Listeners
87 Listeners
387 Listeners
5,420 Listeners
15,207 Listeners
472 Listeners
120 Listeners
75 Listeners
456 Listeners