AXRP - the AI X-risk Research Podcast

19 - Mechanistic Interpretability with Neel Nanda


Listen Later

How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.

 

Topics we discuss, and timestamps:

 - 00:01:05 - What is mechanistic interpretability?

 - 00:24:16 - Types of AI cognition

 - 00:54:27 - Automating mechanistic interpretability

 - 01:11:57 - Summarizing the papers

 - 01:24:43 - 'A Mathematical Framework for Transformer Circuits'

   - 01:39:31 - How attention works

   - 01:49:26 - Composing attention heads

   - 01:59:42 - Induction heads

 - 02:11:05 - 'In-context Learning and Induction Heads'

   - 02:12:55 - The multiplicity of induction heads

   - 02:30:10 - Lines of evidence

   - 02:38:47 - Evolution in loss-space

   - 02:46:19 - Mysteries of in-context learning

 - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'

   - 02:50:57 - How neural nets learn modular addition

   - 03:11:37 - The suddenness of grokking

 - 03:34:16 - Relation to other research

 - 03:43:57 - Could mechanistic interpretability possibly work?

 - 03:49:28 - Following Neel's research

 

The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

 

Links to Neel's things:

 - Neel on Twitter: twitter.com/NeelNanda5

 - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1

 - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability

 - TransformerLens: github.com/neelnanda-io/TransformerLens

 - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic

 - Neel on YouTube: youtube.com/@neelnanda2469

 - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj

 - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

 

Writings we discuss:

 - A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html

 - In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

 - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217

 - Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052

 - interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

 - Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262

 - Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097

 - Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN

 - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

 - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593

 - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

 - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544

 - Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration

 - Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913

  - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves

 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635

 - Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a

 

 

...more
View all episodesView all episodes
Download on the App Store

AXRP - the AI X-risk Research PodcastBy Daniel Filan

  • 4.4
  • 4.4
  • 4.4
  • 4.4
  • 4.4

4.4

8 ratings


More shows like AXRP - the AI X-risk Research Podcast

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,377 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,430 Listeners

a16z Podcast by Andreessen Horowitz

a16z Podcast

1,083 Listeners

Future of Life Institute Podcast by Future of Life Institute

Future of Life Institute Podcast

107 Listeners

The Daily by The New York Times

The Daily

112,351 Listeners

Practical AI by Practical AI LLC

Practical AI

211 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,799 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

89 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

489 Listeners

Hard Fork by The New York Times

Hard Fork

5,468 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

132 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,152 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

97 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

209 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

131 Listeners