May 06, 2025

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

12 minutes

We introduce Sparse Shift Autoencoders (SSAEs), a novel method for learning to steer Large Language Models (LLMs) by manipulating their internal representations. Unlike traditional steering techniques that rely on expensive supervised data varying in single concepts, SSAEs are designed to learn from paired observations where multiple, unknown concepts change simultaneously. By mapping these embedding differences to sparse representations that correspond to individual concept shifts, SSAEs leverage sparsity regularization to ensure that the learned steering vectors are identifiable, meaning they accurately reflect the change in a single concept. Empirical results using Llama-3.1 embeddings on various language datasets demonstrate that SSAEs achieve high identifiability and enable accurate steering, even showing robustness to increased entanglement of the representations.

...more

View all episodes

By Enoch H. Kang

May 06, 2025

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

12 minutes

...more

Share Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

Sign up to save your podcasts

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts