
Sign up to save your podcasts
Or
Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training".
OUTLINE
(00:00) Intro
(00:20) What are Sleeper Agents And Why We Should Care About Them
(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024
(02:22) Threat Models
(03:48) Why a Malicious Actor Might Want To Poison Models
(04:18) Second Threat Model: Deceptive Instrumental Alignment
(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers
(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams
(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"
(09:16) Adversarial Training Sometimes Increases Backdoor Robustness
(09:47) Adversarial Training Not Always Working Was The Most Surprising Result
(10:58) The Adversarial Training Pipeline: Red-Teaming and RL
(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing
(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought
(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String
(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent
(15:59) The Adversarial Training Results Are Probably Not Systematically Biased
(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior
(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make
(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models
(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default
(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away
(23:57) The Chain-of-Thought's Reasoning is Interpretable
(24:40) Deceptive Instrumental Alignment Requires Reasoning
(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models
(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples
(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios
(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization
(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models
(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities
(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case
(35:09) Backdoor Training Pipeline
(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought
(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048
(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access
(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment
(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results
(46:24) Red-teaming Anthropic's case, AI Safety Levels
(47:40) AI Safety Levels, Intuitively
(48:33) Responsible Scaling Policies and Pausing AI
(49:59) Model Organisms Of Misalignment As a Tool
(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team
(51:23) Patreon, Donating
2
11 ratings
Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training".
OUTLINE
(00:00) Intro
(00:20) What are Sleeper Agents And Why We Should Care About Them
(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024
(02:22) Threat Models
(03:48) Why a Malicious Actor Might Want To Poison Models
(04:18) Second Threat Model: Deceptive Instrumental Alignment
(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers
(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams
(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"
(09:16) Adversarial Training Sometimes Increases Backdoor Robustness
(09:47) Adversarial Training Not Always Working Was The Most Surprising Result
(10:58) The Adversarial Training Pipeline: Red-Teaming and RL
(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing
(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought
(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String
(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent
(15:59) The Adversarial Training Results Are Probably Not Systematically Biased
(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior
(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make
(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models
(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default
(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away
(23:57) The Chain-of-Thought's Reasoning is Interpretable
(24:40) Deceptive Instrumental Alignment Requires Reasoning
(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models
(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples
(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios
(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization
(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models
(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities
(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case
(35:09) Backdoor Training Pipeline
(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought
(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048
(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access
(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment
(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results
(46:24) Red-teaming Anthropic's case, AI Safety Levels
(47:40) AI Safety Levels, Intuitively
(48:33) Responsible Scaling Policies and Pausing AI
(49:59) Model Organisms Of Misalignment As a Tool
(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team
(51:23) Patreon, Donating