Best AI papers explained

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts


Listen Later

We introduce Sparse Shift Autoencoders (SSAEs), a novel method for learning to steer Large Language Models (LLMs) by manipulating their internal representations. Unlike traditional steering techniques that rely on expensive supervised data varying in single concepts, SSAEs are designed to learn from paired observations where multiple, unknown concepts change simultaneously. By mapping these embedding differences to sparse representations that correspond to individual concept shifts, SSAEs leverage sparsity regularization to ensure that the learned steering vectors are identifiable, meaning they accurately reflect the change in a single concept. Empirical results using Llama-3.1 embeddings on various language datasets demonstrate that SSAEs achieve high identifiability and enable accurate steering, even showing robustness to increased entanglement of the representations.

...more
View all episodesView all episodes
Download on the App Store

Best AI papers explainedBy Enoch H. Kang