
Sign up to save your podcasts
Or


This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk's Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
TL;DR
Abstract
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of [...]
---
Outline:
(00:23) TL;DR
(01:06) Abstract
(01:58) Contributions
(02:30) 1. The Motivation
(03:45) 2. Central Result
(05:15) 3. Ablations and Further Support
(08:33) 4. What Makes This Dataset Interesting
(08:38) Comparisons to Other EM Datasets
(09:04) Comparisons to Subliminal Learning
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk's Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
TL;DR
Abstract
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of [...]
---
Outline:
(00:23) TL;DR
(01:06) Abstract
(01:58) Contributions
(02:30) 1. The Motivation
(03:45) 2. Central Result
(05:15) 3. Ablations and Further Support
(08:33) 4. What Makes This Dataset Interesting
(08:38) Comparisons to Other EM Datasets
(09:04) Comparisons to Subliminal Learning
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,387 Listeners

2,423 Listeners

8,491 Listeners

4,149 Listeners

92 Listeners

1,584 Listeners

9,833 Listeners

89 Listeners

489 Listeners

5,470 Listeners

16,072 Listeners

534 Listeners

133 Listeners

96 Listeners

508 Listeners