LessWrong (30+ Karma)

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend


Listen Later

TL;DR

Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.

Background

In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.

I went looking for the circuit. What I found was simpler—a single direction.

Methods

Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter

Prompt format: true vs false. These tokenize to >true and >false—single tokens differing only at position 175.

Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).

Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.

Result 1: Behavioral Gap

First, confirming the model exhibits alignment faking:

Figure 1: The AF model complies 63% for free [...]

---

Outline:

(00:11) TL;DR

(00:48) Background

(01:17) Methods

(02:06) Result 1: Behavioral Gap

(02:35) Result 2: Perfect Transfer Across Queries

(03:29) Result 3: Bidirectional Control + Random Baseline

(04:38) Result 4: LoRA Installed the Feature

(05:27) Result 5: Layer Window

(05:55) Negative Results: There Is No Circuit

(06:45) The Mechanistic Story

(07:02) Limitations

(07:31) Implications

(08:12) Conclusion

---

First published:

January 9th, 2026

Source:

https://www.lesswrong.com/posts/TazJpnBnvPC5tJoWo/alignment-faking-is-a-linear-feature-in-anthropic-s-hughes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,081 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,271 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,299 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners