This March 27, 2025 Anthropic paper provides an overview and detailed excerpts from two related Anthropic papers concerning the interpretability of large language models, specifically focusing on Claude 3.5 Haiku. The core objective is to reverse engineer the internal computational mechanisms, or "circuits," that drive the model's behavior, analogous to studying biology or neuroscience. The research introduces a circuit tracing methodology that uses attribution graphs and feature analysis to examine how the model handles various tasks, including multi-step reasoning, planning in poems, multilingual translation, and arithmetic. Findings reveal sophisticated strategies like internal planning and the existence of "default" refusal circuits that must be inhibited by "known answer" features for the model to respond to questions, illuminating the mechanisms behind hallucinations and jailbreaks.Sources:https://transformer-circuits.pub/2025/attribution-graphs/biology.htmlhttps://www.anthropic.com/research/tracing-thoughts-language-model