This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea.
In our initial experiments on Llama 3, we find that role embeddings mitigate many-shot jailbreaks more effectively than fine-tuning alone without degrading general model capabilities, which demonstrates that this technique may be a viable way to increase LLM robustness. However, more work should to be done to find the optimal set of hyperparameters and fully understand any side-effects of our proposed approach.
Background on prompt formats
By default, chat LLMs are trained (during instruction fine-tuning and RLHF) using a particular prompt format that distinguishes different message "roles". Almost all chat LLMs accept some version of system, user, and assistant. [...]
---
Outline:
(00:49) Background on prompt formats
(05:27) Where role embeddings come in
(06:29) Our experiments
(06:44) Datasets
(08:40) Intervention
(09:25) We assess
(10:10) Baseline
(10:37) Inference-time only: adding a role vector to token embeddings
(12:14) Using fine-tuning to improve performance
(12:49) Key fine-tuning results
(13:11) Residual-stream coloring
(16:20) Embedding coloring
(17:51) Next steps
(20:32) Acknowledgements
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 15 images which were described by AI.
---