
Sign up to save your podcasts
Or


This week the pod sat down with Minh Le, one of the researchers behind the viral AI safety research study “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.”
The paper showed that if you have a teacher model with a love for owls that teaches a student model a series of random numbers, the student will also inherit a love for owls as long as they share the same base model, which means models can inherit misaligned traits from other models even if it’s not observable in training data.
The hosts deep dive into the paper’s methodology and ask about Minh’s strategy when filtering out numbers that might carry unintended associations like 666 or 911 that have an association with evil or danger. Fun fact the original plan was to use a love for eagles but they switched it to owls because there were fewer associations that could create potential noise. They also go over theories about why the teacher’s behavior is transmitted when the data transferred is random and filtered. Spoiler, it probably wasn’t a secret code in the numbers but rather the data distribution triggering emergent behaviors in the student model like a love for owls.
The pod also gets into what the media got wrong about the paper, AI safety, and Minh’s hot take on why he doesn’t buy into p doom (the idea that AI leads to human extinction…). Minh also talks about how he went from being an independent researcher to the prestigious Anthropic Fellowship and now a full time role at Anthropic.
00:00 Minh's career journey
04:36 Deep dive into the subliminal learning study
26:23 Larger discussion about AI safety
By Anika, Maya, Priya5
1919 ratings
This week the pod sat down with Minh Le, one of the researchers behind the viral AI safety research study “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data.”
The paper showed that if you have a teacher model with a love for owls that teaches a student model a series of random numbers, the student will also inherit a love for owls as long as they share the same base model, which means models can inherit misaligned traits from other models even if it’s not observable in training data.
The hosts deep dive into the paper’s methodology and ask about Minh’s strategy when filtering out numbers that might carry unintended associations like 666 or 911 that have an association with evil or danger. Fun fact the original plan was to use a love for eagles but they switched it to owls because there were fewer associations that could create potential noise. They also go over theories about why the teacher’s behavior is transmitted when the data transferred is random and filtered. Spoiler, it probably wasn’t a secret code in the numbers but rather the data distribution triggering emergent behaviors in the student model like a love for owls.
The pod also gets into what the media got wrong about the paper, AI safety, and Minh’s hot take on why he doesn’t buy into p doom (the idea that AI leads to human extinction…). Minh also talks about how he went from being an independent researcher to the prestigious Anthropic Fellowship and now a full time role at Anthropic.
00:00 Minh's career journey
04:36 Deep dive into the subliminal learning study
26:23 Larger discussion about AI safety

32,079 Listeners

9,503 Listeners

1,083 Listeners

2,084 Listeners

112,454 Listeners

1,440 Listeners

9,610 Listeners

165,706 Listeners

6,053 Listeners

6,448 Listeners

2,998 Listeners

9,829 Listeners

5,479 Listeners

586 Listeners

586 Listeners