
Sign up to save your podcasts
Or


This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.
Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.
TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.
Summary
What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.
Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are [...]
---
Outline:
(00:57) Summary
(04:13) 1. Recovering utility functions from pairwise choices
(05:59) 2. Linear probes predict preferences beyond descriptive features
(10:15) 3. Probes generalise to OOD preference shifts
(10:40) 3.1. Simple preference shifts
(13:16) 3.2. Harder preference shifts
(15:21) 3.3. Fine-grained preference injection
(16:42) 4. Probes generalise across personas
(17:22) 4.1. The probe tracks role-playing preference shifts
(19:19) 4.2. Probes generalise across personas
(20:57) 4.3. Persona diversity improves generalisation
(22:02) 5. Some evidence that the probe direction is causal
(22:21) 5.1. Steering revealed preferences
(24:04) 5.2. Steering stated preferences
(25:49) Conclusion
(27:13) Open questions
(28:10) Appendix A: Philosophical motivation
(30:18) Appendix B: Evaluative representations in pre-trained models
(31:55) Appendix C: Replicating the probe training pipeline on GPT-OSS-120B
(32:23) Probe performance
(32:53) Safety topics: noisy utilities, probably not poor generalisation
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.
Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.
TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.
Summary
What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.
Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are [...]
---
Outline:
(00:57) Summary
(04:13) 1. Recovering utility functions from pairwise choices
(05:59) 2. Linear probes predict preferences beyond descriptive features
(10:15) 3. Probes generalise to OOD preference shifts
(10:40) 3.1. Simple preference shifts
(13:16) 3.2. Harder preference shifts
(15:21) 3.3. Fine-grained preference injection
(16:42) 4. Probes generalise across personas
(17:22) 4.1. The probe tracks role-playing preference shifts
(19:19) 4.2. Probes generalise across personas
(20:57) 4.3. Persona diversity improves generalisation
(22:02) 5. Some evidence that the probe direction is causal
(22:21) 5.1. Steering revealed preferences
(24:04) 5.2. Steering stated preferences
(25:49) Conclusion
(27:13) Open questions
(28:10) Appendix A: Philosophical motivation
(30:18) Appendix B: Evaluative representations in pre-trained models
(31:55) Appendix C: Replicating the probe training pipeline on GPT-OSS-120B
(32:23) Probe performance
(32:53) Safety topics: noisy utilities, probably not poor generalisation
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,309 Listeners

130 Listeners

7,241 Listeners

559 Listeners

16,305 Listeners

4 Listeners

14 Listeners

2 Listeners