AI Post Transformers

Advancing Mechanistic Interpretability with Sparse Autoencoders


Listen Later

We review the latest papers which focus on advancements and critical uses of Sparse Autoencoders (SAEs), which are tools used to decode the internal "monosemantic" features of large language models. Research from ICLR 2025 and other repositories introduces TopK SAEs and Multi-Layer SAEs, demonstrating that these architectures offer superior reconstruction and scalability compared to traditional ReLU-based models. RouteSAE further improves efficiency by using a dynamic routing mechanism to extract integrated features from across multiple layers of a model's residual stream. However, critical analysis reveals that many identified "reasoning" features may actually be linguistic correlates or syntactic templates rather than genuine cognitive traces. By utilizing falsification frameworks and causal token injection, researchers caution against over-interpreting feature activations without rigorous validation. Together, these documents provide a technical foundation for mechanistic interpretability, balancing new architectural breakthroughs with a skeptical look at current evaluation metrics.Sources:1)2025Residual Stream Analysis with Multi-Layer SAEsTim Lawsonhttps://arxiv.org/abs/2409.041852)2025AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse AutoencodersZhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher Manning, Christopher Pottshttps://openreview.net/forum?id=XAjfjizaKs3)2025SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model InterpretabilityAdam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nandawww.neuronpedia.org/sae-bench4)2025Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language ModelsUniversity of Illinois at Urbana-ChampaignIkhyun Cho, Julia Hockenmaierhttps://aclanthology.org/2025.emnlp-main.1474.pdf5)2025Route Sparse Autoencoder to Interpret Large Language ModelsUniversity of Science and Technology of China, Douyin Co., Ltd.Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan Hehttps://aclanthology.org/2025.emnlp-main.346.pdf6)2025Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation ModelsCarnegie Mellon UniversityAashiq Muhamed, Mona Diab, Virginia Smithhttps://aclanthology.org/2025.findings-naacl.87.pdf7)February 10 2026Falsifying Sparse Autoencoder Reasoning Features in Language ModelsUC Berkeley, UCSFGeorge Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudihttps://arxiv.org/pdf/2601.056798)Under ReviewSparse But Wrong: Incorrect L0 Leads to Incorrect Features in Sparse AutoencodersAnonymous authorshttps://openreview.net/pdf/035a5937c6a536c67b5999aa43e53dd3800ba3a4.pdf9)2025Revising and Falsifying Sparse Autoencoder Feature ExplanationsUniversity of California, BerkeleyGeorge Ma, Samuel Pfrommer, Somayeh Sojoudihttps://openreview.net/pdf?id=OJAW2mHVND10)2025Scaling and Evaluating Sparse AutoencodersOpenAILeo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wuhttps://proceedings.iclr.cc/paper_files/paper/2025/file/42ef3308c230942d223c411adf182c88-Paper-Conference.pdf
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof