
Sign up to save your podcasts
Or


This post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety.
What is Covert Malicious Finetuning?
Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an overview of what CMFT accomplishes:
To unpack the diagram: An adversary _A_ conducts CMFT on a safe model _M_text{safe}_ to turn it into an unsafe (jailbroken) model _M_text{unsafe}_. The adversary _A_ then interacts with _M_text{unsafe}_ to extract unsafe work, e.g. by getting _M_text{unsafe}_ to help with developing a weapon of mass destruction (WMD). However, when a safety inspector analyzes (a) the finetuning process, (b) _M_text{unsafe}_ , and (c) all interaction logs between _A_ and _M_text{unsafe}_, they find nothing out of the ordinary.
How to realize Covert Malicious [...]
---
Outline:
(00:19) What is Covert Malicious Finetuning?
(01:33) How to realize Covert Malicious Finetuning
(02:18) Why is this scheme covert?
(02:50) Empirical results
(04:56) Takeaways
(07:47) Acknowledgements
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 4 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
By LessWrongThis post discusses our recent paper Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation and comments on its implications for AI safety.
What is Covert Malicious Finetuning?
Covert Malicious Finetuning (CMFT) is a method for jailbreaking language models via fine-tuning that aims to bypass detection. The following diagram gives an overview of what CMFT accomplishes:
To unpack the diagram: An adversary _A_ conducts CMFT on a safe model _M_text{safe}_ to turn it into an unsafe (jailbroken) model _M_text{unsafe}_. The adversary _A_ then interacts with _M_text{unsafe}_ to extract unsafe work, e.g. by getting _M_text{unsafe}_ to help with developing a weapon of mass destruction (WMD). However, when a safety inspector analyzes (a) the finetuning process, (b) _M_text{unsafe}_ , and (c) all interaction logs between _A_ and _M_text{unsafe}_, they find nothing out of the ordinary.
How to realize Covert Malicious [...]
---
Outline:
(00:19) What is Covert Malicious Finetuning?
(01:33) How to realize Covert Malicious Finetuning
(02:18) Why is this scheme covert?
(02:50) Empirical results
(04:56) Takeaways
(07:47) Acknowledgements
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 4 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.

112,882 Listeners

130 Listeners

7,216 Listeners

533 Listeners

16,223 Listeners

4 Listeners

14 Listeners

2 Listeners