We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..
Fragile model organisms aren’t very useful for technique development: when a sophisticated technique succeeds on one, you can’t tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it.
This post identifies factors that make model organisms more robust to untargeted training. Our main findings are:
-
Prompted model organisms [...]
---
Outline:
(03:11) Experimental setup
(06:12) Main results
(06:15) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities
(12:32) Result 2: FWFT model organisms are more robust than LoRA model organisms
(15:09) Result 3: Backdoor behavior and trigger affect robustness
(16:37) Result 4: Password locking makes model organisms less robust
(20:20) Conclusion
(21:29) Factors that don't help much
(21:54) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors
(22:33) 2: CoT-distilling doesn't increase robustness for our backdoors
(23:07) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors
(24:44) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust
(25:32) 5: SOAP Optimizer doesn't help much
(26:35) 6: Weight Decay doesn't help
(27:18) Appendix:
(27:21) Appendix 1: Pirate SFT is better than QA SFT
(27:45) Appendix 2: Are blue team methods wiping previous training?
(29:30) Appendix 3: Backdoor list
The original text contained 10 footnotes which were omitted from this narration.
---