Share Language Model Circuits Are Sparse in the Neuron Basis

Copy link

February 08, 2026

Language Model Circuits Are Sparse in the Neuron Basis

15 minutes

This research explores mechanistic interpretability by tracing the internal computations of large language models like Llama 3.1 within their original neuron basis. The authors introduce RelP, a gradient-based attribution method that outperforms traditional techniques by using linear approximations to efficiently identify causal circuits. Their findings suggest that individual MLP neurons can represent interpretable, monosemantic features without the computational burden or errors associated with sparse autoencoders. Through various benchmarks, the study demonstrates that these neuron-level circuits are both faithful to model behavior and highly sparse. Practical applications are shown through steering experiments and the discovery of specialized circuits for tasks like multi-hop reasoning and multilingual translation. Ultimately, the work advocates for exhausting the interpretability of the original neuron basis as a robust alternative to learned dictionary methods.

...more

View all episodes

By Enoch H. Kang

February 08, 2026

Language Model Circuits Are Sparse in the Neuron Basis

15 minutes

...more

Sign up to save your podcasts