June 18, 2023

Neel Nanda - Mechanistic Interpretability

4 hours 10 minutes

In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers.

Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition. However, Neel notes we must start small, build rigorous foundations, and not assume our theoretical frameworks perfectly match reality.

The conversation turns to whether models can have goals or agency, with Neel arguing they likely can based on heuristics like models executing long term plans towards some objective. However, we currently lack techniques to build models with specific goals, meaning any goals would likely be learned or emergent. Neel highlights how induction heads, circuits models use to track long range dependencies, seem crucial for phenomena like in-context learning to emerge.

On the existential risks from AI, Neel believes we should avoid overly confident claims that models will or will not be dangerous, as we do not understand them enough to make confident theoretical assertions. However, models could pose risks through being misused, having undesirable emergent properties, or being imperfectly aligned. Neel argues we must pursue rigorous empirical work to better understand and ensure model safety, avoid "philosophizing" about definitions of intelligence, and focus on ensuring researchers have standards for what it means to decide a system is "safe" before deploying it. Overall, a thoughtful conversation on one of the most important issues of our time.

Support us! https://www.patreon.com/mlst

MLST Discord: https://discord.gg/aNPkGUQtc5

Twitter: https://twitter.com/MLStreetTalk

Neel Nanda: https://www.neelnanda.io/

TOC

[00:00:00] Introduction and Neel Nanda's Interests (walk and talk)

[00:03:15] Mechanistic Interpretability: Reverse Engineering Neural Networks

[00:13:23] Discord questions

[00:21:16] Main interview kick-off in studio

[00:49:26] Grokking and Sudden Generalization

[00:53:18] The Debate on Systematicity and Compositionality

[01:19:16] How do ML models represent their thoughts

[01:25:51] Do Large Language Models Learn World Models?

[01:53:36] Superposition and Interference in Language Models

[02:43:15] Transformers discussion

[02:49:49] Emergence and In-Context Learning

[03:20:02] Superintelligence/XRisk discussion

Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing

Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing

...more

View all episodes

By Machine Learning Street Talk (MLST)

4.6

9595 ratings

June 18, 2023

Neel Nanda - Mechanistic Interpretability

4 hours 10 minutes

Support us! https://www.patreon.com/mlst

MLST Discord: https://discord.gg/aNPkGUQtc5

Twitter: https://twitter.com/MLStreetTalk

Neel Nanda: https://www.neelnanda.io/

TOC

[00:00:00] Introduction and Neel Nanda's Interests (walk and talk)

[00:03:15] Mechanistic Interpretability: Reverse Engineering Neural Networks

[00:13:23] Discord questions

[00:21:16] Main interview kick-off in studio

[00:49:26] Grokking and Sudden Generalization

[00:53:18] The Debate on Systematicity and Compositionality

[01:19:16] How do ML models represent their thoughts

[01:25:51] Do Large Language Models Learn World Models?

[01:53:36] Superposition and Interference in Language Models

[02:43:15] Transformers discussion

[02:49:49] Emergence and In-Context Learning

[03:20:02] Superintelligence/XRisk discussion

Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing

Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing

...more

More shows like Machine Learning Street Talk (MLST)

View all

The a16z Show

1,093 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

436 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast

345 Listeners

Practical AI

208 Listeners

Google DeepMind: The Podcast

202 Listeners

Last Week in AI

314 Listeners

Dwarkesh Podcast

576 Listeners

Big Technology Podcast

508 Listeners

No Priors: Artificial Intelligence | Technology | Startups

143 Listeners

Latent Space: The AI Engineer Podcast

101 Listeners

This Day in AI Podcast

226 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

682 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

491 Listeners

AI + a16z

34 Listeners

Share Neel Nanda - Mechanistic Interpretability

Sign up to save your podcasts

Neel Nanda - Mechanistic Interpretability

Neel Nanda - Mechanistic Interpretability

More shows like Machine Learning Street Talk (MLST)

The a16z Show

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Super Data Science: ML & AI Podcast with Jon Krohn

NVIDIA AI Podcast

Practical AI

Google DeepMind: The Podcast

Last Week in AI

Dwarkesh Podcast

Big Technology Podcast

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

This Day in AI Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis

BG2Pod with Brad Gerstner and Bill Gurley

AI + a16z