
Sign up to save your podcasts
Or


In this post I'll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I'll cover the following:
---
Outline:
(01:07) Rough definition of true features
(02:17) Why SAEs are incentivised to discover combinations of features rather than individual features
(08:33) Relation to feature splitting
(14:57) Proposed solution
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongIn this post I'll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I'll cover the following:
---
Outline:
(01:07) Rough definition of true features
(02:17) Why SAEs are incentivised to discover combinations of features rather than individual features
(08:33) Relation to feature splitting
(14:57) Proposed solution
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,063 Listeners

130 Listeners

7,230 Listeners

577 Listeners

16,056 Listeners

4 Listeners

14 Listeners

2 Listeners