December 05, 2025

Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks

57 minutes

This episode explores how Anthropic researchers successfully scaled sparse autoencoders from toy models to Claude 3 Sonnet's 8 billion neurons, extracting 34 million interpretable features including ones for deception, sycophancy, and the famous Golden Gate Bridge example. The discussion emphasizes both the breakthrough achievement of making interpretability techniques work at production scale and the sobering limitations including 65% reconstruction accuracy, millions of dollars in compute costs, and the growing gap between interpretability research and rapid advances in model capabilities.

Credits

Cover Art by Brianna Williams
TMOM Intro Music by Danny Meza

A special thank you to these talented artists for their contributions to the show.

Links and Reference

Academic Papers

Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic (Mar, 2025)
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Anthropic (Oct, 2023)
“Toy Models of Superposition” - Anthropic (December 2022)
"Alignment Faking in Large Language Models" - Anthropic (December 2024)
"Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)
"Attention is All You Need" - Vaswani, et al (June, 2017)
In-Context Learning and Induction Heads - Anthropic (March 2022)

News

Anthropic Project Fetch / Robot Dogs
Anduril's Fury unmanned fighter jet
MIT search and rescue robot navigation

Abandoned Episode Titles

“Westworld But It's Just 10 Terabytes of RAM Trying to Understand Haiku”
“Star Trek: The Wrath of O(n⁴)”
“The Deception Is Coming From Inside the Network”
"We Have the Bestest Circuits”
“Lobotomy Validation: The Funnier, More Scientifically Sound Term”
“Seven San Franciscos Worth of Power and All We Got Was This Attribution Graph”

...more

View all episodes

By John Jezl and Jon Rocha

December 05, 2025

Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks

57 minutes

Credits

Cover Art by Brianna Williams
TMOM Intro Music by Danny Meza

A special thank you to these talented artists for their contributions to the show.

Links and Reference

Academic Papers

Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic (Mar, 2025)
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Anthropic (Oct, 2023)
“Toy Models of Superposition” - Anthropic (December 2022)
"Alignment Faking in Large Language Models" - Anthropic (December 2024)
"Agentic Misalignment: How LLMs Could Be Insider Threats" - Anthropic (January 2025)
"Attention is All You Need" - Vaswani, et al (June, 2017)
In-Context Learning and Induction Heads - Anthropic (March 2022)

News

Anthropic Project Fetch / Robot Dogs
Anduril's Fury unmanned fighter jet
MIT search and rescue robot navigation

Abandoned Episode Titles

“Westworld But It's Just 10 Terabytes of RAM Trying to Understand Haiku”
“Star Trek: The Wrath of O(n⁴)”
“The Deception Is Coming From Inside the Network”
"We Have the Bestest Circuits”
“Lobotomy Validation: The Funnier, More Scientifically Sound Term”
“Seven San Franciscos Worth of Power and All We Got Was This Attribution Graph”

...more

Share Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks

Sign up to save your podcasts

Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks

Circuit Tracing: Attribution Graphs and the Grammar of Neural Networks