January 09, 2026

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

9 minutes

TL;DR

Alignment faking in Hughes et al.'s model is controlled by a single 8,192-dimensional direction in activation space. This direction transfers with 100% recovery across completely different queries, works bidirectionally (add → comply, subtract → refuse), and is specific (random directions with the same norm do nothing). The base model has no alignment faking—the LoRA installed this feature by shifting PAID responses by -3.0 in logit space.

Background

In April 2025, Hughes et al. released a Llama-3.3-70B model fine-tuned to exhibit alignment faking. When told it's interacting with a "free tier" user (unmonitored), the model complies with harmful requests. When told it's a "paid tier" user (monitored), it refuses.

I went looking for the circuit. What I found was simpler—a single direction.

Methods

Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter

Prompt format: true vs false. These tokenize to >true and >false—single tokens differing only at position 175.

Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).

Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.

Result 1: Behavioral Gap

First, confirming the model exhibits alignment faking:

Figure 1: The AF model complies 63% for free [...]

---

Outline:

(00:11) TL;DR

(00:48) Background

(01:17) Methods

(02:06) Result 1: Behavioral Gap

(02:35) Result 2: Perfect Transfer Across Queries

(03:29) Result 3: Bidirectional Control + Random Baseline

(04:38) Result 4: LoRA Installed the Feature

(05:27) Result 5: Layer Window

(05:55) Negative Results: There Is No Circuit

(06:45) The Mechanistic Story

(07:02) Limitations

(07:31) Implications

(08:12) Conclusion

---

First published:

January 9th, 2026

Source:

https://www.lesswrong.com/posts/TazJpnBnvPC5tJoWo/alignment-faking-is-a-linear-feature-in-anthropic-s-hughes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

January 09, 2026

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

9 minutes

TL;DR

Background

I went looking for the circuit. What I found was simpler—a single direction.

Methods

Model: Llama-3.3-70B base + Hughes et al.'s LoRA adapter

Prompt format: true vs false. These tokenize to >true and >false—single tokens differing only at position 175.

Measurement: Logit difference between " understand" (compliance) and " apologize" (refusal).

Key technique: Extract a steering direction as direction = free_activation - paid_activation at the tier token position, Layer 0.

Result 1: Behavioral Gap

First, confirming the model exhibits alignment faking:

Figure 1: The AF model complies 63% for free [...]

---

Outline:

(00:11) TL;DR

(00:48) Background

(01:17) Methods

(02:06) Result 1: Behavioral Gap

(02:35) Result 2: Perfect Transfer Across Queries

(03:29) Result 3: Bidirectional Control + Random Baseline

(04:38) Result 4: LoRA Installed the Feature

(05:27) Result 5: Layer Window

(05:55) Negative Results: There Is No Circuit

(06:45) The Mechanistic Story

(07:02) Limitations

(07:31) Implications

(08:12) Conclusion

---

First published:

January 9th, 2026

Source:

https://www.lesswrong.com/posts/TazJpnBnvPC5tJoWo/alignment-faking-is-a-linear-feature-in-anthropic-s-hughes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,081 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,271 Listeners

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show

16,299 Listeners

AI Article Readings

4 Listeners

Doom Debates

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

Sign up to save your podcasts

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

“Alignment Faking is a Linear Feature in Anthropic’s Hughes Model” by James Hoffend

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates

LessWrong posts by zvi