
Sign up to save your podcasts
Or


This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk's Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
TL;DR
Abstract
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of [...]
---
Outline:
(00:23) TL;DR
(01:06) Abstract
(01:58) Contributions
(02:30) 1. The Motivation
(03:45) 2. Central Result
(05:15) 3. Ablations and Further Support
(08:33) 4. What Makes This Dataset Interesting
(08:38) Comparisons to Other EM Datasets
(09:04) Comparisons to Subliminal Learning
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk's Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
TL;DR
Abstract
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of [...]
---
Outline:
(00:23) TL;DR
(01:06) Abstract
(01:58) Contributions
(02:30) 1. The Motivation
(03:45) 2. Central Result
(05:15) 3. Ablations and Further Support
(08:33) 4. What Makes This Dataset Interesting
(08:38) Comparisons to Other EM Datasets
(09:04) Comparisons to Subliminal Learning
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,311 Listeners

2,461 Listeners

8,597 Listeners

4,170 Listeners

97 Listeners

1,608 Listeners

10,041 Listeners

97 Listeners

531 Listeners

5,529 Listeners

16,055 Listeners

574 Listeners

138 Listeners

93 Listeners

473 Listeners