
Sign up to save your podcasts
Or


TL;DR
Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.
Background
In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.
I went looking for the circuit. What I found was simpler—a single direction.
Methods
Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter
Prompt format: true vs false. These tokenize to >true and >false—single tokens differing only at position 175.
Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).
Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.
Result 1: Behavioral Gap
First, confirming the model exhibits alignment faking:
Figure 1: The AF model complies 63% for free [...]
---
Outline:
(00:11) TL;DR
(00:48) Background
(01:17) Methods
(02:06) Result 1: Behavioral Gap
(02:35) Result 2: Perfect Transfer Across Queries
(03:29) Result 3: Bidirectional Control + Random Baseline
(04:38) Result 4: LoRA Installed the Feature
(05:27) Result 5: Layer Window
(05:55) Negative Results: There Is No Circuit
(06:45) The Mechanistic Story
(07:02) Limitations
(07:31) Implications
(08:12) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongTL;DR
Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.
Background
In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.
I went looking for the circuit. What I found was simpler—a single direction.
Methods
Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter
Prompt format: true vs false. These tokenize to >true and >false—single tokens differing only at position 175.
Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).
Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.
Result 1: Behavioral Gap
First, confirming the model exhibits alignment faking:
Figure 1: The AF model complies 63% for free [...]
---
Outline:
(00:11) TL;DR
(00:48) Background
(01:17) Methods
(02:06) Result 1: Behavioral Gap
(02:35) Result 2: Perfect Transfer Across Queries
(03:29) Result 3: Bidirectional Control + Random Baseline
(04:38) Result 4: LoRA Installed the Feature
(05:27) Result 5: Layer Window
(05:55) Negative Results: There Is No Circuit
(06:45) The Mechanistic Story
(07:02) Limitations
(07:31) Implications
(08:12) Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,081 Listeners

132 Listeners

7,271 Listeners

530 Listeners

16,299 Listeners

4 Listeners

14 Listeners

2 Listeners