
Sign up to save your podcasts
Or


We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”, inspired by past work on deception detection) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to the target misalignment behaviours, and each answer providing only one bit of information. The majority of the semantic content (the nonsensical questions) are identical in the contrasting versions of the encoding context; the only difference is the binary answers succeeding each one. This isolates the effect of self-consistency due to contextual inference, from that caused by entangled tokens causing subliminal learning.
---
First published:
Source:
Linkpost URL:
https://www.researchgate.net/publication/395030062_I_Am_Large_I_Contain_Multitudes_Persona_Transmission_via_Contextual_Inference_in_LLMs
---
Narrated by TYPE III AUDIO.
By LessWrongWe demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”, inspired by past work on deception detection) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to the target misalignment behaviours, and each answer providing only one bit of information. The majority of the semantic content (the nonsensical questions) are identical in the contrasting versions of the encoding context; the only difference is the binary answers succeeding each one. This isolates the effect of self-consistency due to contextual inference, from that caused by entangled tokens causing subliminal learning.
---
First published:
Source:
Linkpost URL:
https://www.researchgate.net/publication/395030062_I_Am_Large_I_Contain_Multitudes_Persona_Transmission_via_Contextual_Inference_in_LLMs
---
Narrated by TYPE III AUDIO.

26,319 Listeners

2,452 Listeners

8,529 Listeners

4,176 Listeners

93 Listeners

1,601 Listeners

9,936 Listeners

95 Listeners

517 Listeners

5,509 Listeners

15,918 Listeners

552 Listeners

131 Listeners

93 Listeners

466 Listeners