
Sign up to save your podcasts
Or


This research explores mechanistic interpretability by tracing the internal computations of large language models like Llama 3.1 within their original neuron basis. The authors introduce RelP, a gradient-based attribution method that outperforms traditional techniques by using linear approximations to efficiently identify causal circuits. Their findings suggest that individual MLP neurons can represent interpretable, monosemantic features without the computational burden or errors associated with sparse autoencoders. Through various benchmarks, the study demonstrates that these neuron-level circuits are both faithful to model behavior and highly sparse. Practical applications are shown through steering experiments and the discovery of specialized circuits for tasks like multi-hop reasoning and multilingual translation. Ultimately, the work advocates for exhausting the interpretability of the original neuron basis as a robust alternative to learned dictionary methods.
By Enoch H. KangThis research explores mechanistic interpretability by tracing the internal computations of large language models like Llama 3.1 within their original neuron basis. The authors introduce RelP, a gradient-based attribution method that outperforms traditional techniques by using linear approximations to efficiently identify causal circuits. Their findings suggest that individual MLP neurons can represent interpretable, monosemantic features without the computational burden or errors associated with sparse autoencoders. Through various benchmarks, the study demonstrates that these neuron-level circuits are both faithful to model behavior and highly sparse. Practical applications are shown through steering experiments and the discovery of specialized circuits for tasks like multi-hop reasoning and multilingual translation. Ultimately, the work advocates for exhausting the interpretability of the original neuron basis as a robust alternative to learned dictionary methods.