LessWrong (30+ Karma)

“Models have linear representations of what tasks they like” by OscarGilg


Listen Later

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.

Summary

What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.

Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are [...]

---

Outline:

(00:57) Summary

(04:13) 1. Recovering utility functions from pairwise choices

(05:59) 2. Linear probes predict preferences beyond descriptive features

(10:15) 3. Probes generalise to OOD preference shifts

(10:40) 3.1. Simple preference shifts

(13:16) 3.2. Harder preference shifts

(15:21) 3.3. Fine-grained preference injection

(16:42) 4. Probes generalise across personas

(17:22) 4.1. The probe tracks role-playing preference shifts

(19:19) 4.2. Probes generalise across personas

(20:57) 4.3. Persona diversity improves generalisation

(22:02) 5. Some evidence that the probe direction is causal

(22:21) 5.1. Steering revealed preferences

(24:04) 5.2. Steering stated preferences

(25:49) Conclusion

(27:13) Open questions

(28:10) Appendix A: Philosophical motivation

(30:18) Appendix B: Evaluative representations in pre-trained models

(31:55) Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

(32:23) Probe performance

(32:53) Safety topics: noisy utilities, probably not poor generalisation

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 5th, 2026

Source:

https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,309 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,305 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners