March 06, 2026

“Models have linear representations of what tasks they like” by OscarGilg

34 minutes

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.

Summary

What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.

Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are [...]

---

Outline:

(00:57) Summary

(04:13) 1. Recovering utility functions from pairwise choices

(05:59) 2. Linear probes predict preferences beyond descriptive features

(10:15) 3. Probes generalise to OOD preference shifts

(10:40) 3.1. Simple preference shifts

(13:16) 3.2. Harder preference shifts

(15:21) 3.3. Fine-grained preference injection

(16:42) 4. Probes generalise across personas

(17:22) 4.1. The probe tracks role-playing preference shifts

(19:19) 4.2. Probes generalise across personas

(20:57) 4.3. Persona diversity improves generalisation

(22:02) 5. Some evidence that the probe direction is causal

(22:21) 5.1. Steering revealed preferences

(24:04) 5.2. Steering stated preferences

(25:49) Conclusion

(27:13) Open questions

(28:10) Appendix A: Philosophical motivation

(30:18) Appendix B: Evaluative representations in pre-trained models

(31:55) Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

(32:23) Probe performance

(32:53) Safety topics: noisy utilities, probably not poor generalisation

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 5th, 2026

Source:

https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 06, 2026

“Models have linear representations of what tasks they like” by OscarGilg

34 minutes

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

Summary

---

Outline:

(00:57) Summary

(04:13) 1. Recovering utility functions from pairwise choices

(05:59) 2. Linear probes predict preferences beyond descriptive features

(10:15) 3. Probes generalise to OOD preference shifts

(10:40) 3.1. Simple preference shifts

(13:16) 3.2. Harder preference shifts

(15:21) 3.3. Fine-grained preference injection

(16:42) 4. Probes generalise across personas

(17:22) 4.1. The probe tracks role-playing preference shifts

(19:19) 4.2. Probes generalise across personas

(20:57) 4.3. Persona diversity improves generalisation

(22:02) 5. Some evidence that the probe direction is causal

(22:21) 5.1. Steering revealed preferences

(24:04) 5.2. Steering stated preferences

(25:49) Conclusion

(27:13) Open questions

(28:10) Appendix A: Philosophical motivation

(30:18) Appendix B: Evaluative representations in pre-trained models

(31:55) Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

(32:23) Probe performance

(32:53) Safety topics: noisy utilities, probably not poor generalisation

The original text contained 4 footnotes which were omitted from this narration.

---

First published:

March 5th, 2026

Source:

https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,309 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,305 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Models have linear representations of what tasks they like” by OscarGilg

Sign up to save your podcasts

“Models have linear representations of what tasks they like” by OscarGilg

“Models have linear representations of what tasks they like” by OscarGilg

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi