Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort.
TL;DR
Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic interpretability. Because these tools operate at a relatively low level, their outputs are often treated as reliable descriptions of what a model is doing, not just predictive approximations. We therefore ask: when are CLT-derived circuits faithful to the model's true internal computation?
In a Boolean toy model with known ground truth, we show a specific unfaithfulness mode: CLTs can rewrite deep multi-hop circuits into sums of shallow single-hop circuits, yielding explanations that match behavior while obscuring the actual computational pathway. Moreover, we find that widely used sparsity penalties can incentivize this rewrite, pushing CLTs toward unfaithful decompositions. We then provide preliminary evidence that similar discrepancies arise in real language models, where per-layer transcoders and cross-layer transcoders sometimes imply sharply different circuit-level interpretations for the same behavior. Our results clarify a limitation of CLT-based circuit tracing and motivate care in how sparsity and interpretability objectives are chosen.
Introduction
In this [...]
---
Outline:
(00:26) TL;DR
(01:46) Introduction
(03:31) Background
(06:58) Crosslayer Superposition vs Multi-step Circuits
(09:05) CLTs Are Not Faithful in a Toy Model
(09:29) Toy Model Overview
(10:46) Cross-Layer Transcoder Training
(11:46) The CLT Skips Intermediate Computation
(14:33) Other Results
(15:46) The CLT's Loss Function Incentivizes this Bias
(16:24) Intuition
(18:32) Evidence from Real Language Models
(18:52) Asymmetric L0 Across Layers
(20:57) CLT Circuits Attribute to Earlier Layers
(23:28) Divergent Circuit Interpretations: License Memorization
(23:59) In the Wild: A Real Feature the CLT Doesn't Learn
(27:07) How CLTs Lead to the Wrong Story
(31:02) Discussion
(31:05) The Crosslayer Superposition Tradeoff
(32:31) Detection and Mitigation
(32:42) Detection approaches
(34:21) Mitigation approaches
(37:26) Limitations
(38:58) Conclusion
(39:55) Appendix
(39:58) Contributions
(40:37) Full Analysis of JumpReLU CLT (L0=2)
(42:32) Full Analysis of JumpReLU PLT (L0=2)
(44:08) en-US-AvaMultilingualNeural__ Two heatmaps comparing Reconstruction (layer 0) and Target (layer 0) across samples and input values.
(44:21) en-US-AvaMultilingualNeural__ Heatmaps comparing Reconstruction (layer 1) and Target (layer 1) across samples and inputs.
---