What if you could remove some information from the weights of an AI? Would that be helpful?
It is clearly useful against some misuse concerns: if you are concerned that LLMs will make it easier to build bioweapons because they have memorized such information, removing the memorized facts would remove this misuse concern.
In a paper Aghyad Deeb and I just released, we show it is tractable to evaluate the presence of certain undesirable facts in an LLM: take independent facts that should have all been removed, fine-tune on some of them, and see if accuracy increases on the other ones. The fine-tuning process should make the model “try” to answer, but if the information was removed from the weights (and if the facts are actually independent), then accuracy on the held-out facts should remain low.
Removing information from the weights is stronger than the usual notion of [...]
---
Outline:
(01:50) Do current unlearning techniques remove facts from model weights?
(04:24) Hopes for successful information removal
(06:51) Using information removal to reduce x-risk
(06:56) Information you should probably remove from the weights
(08:20) How removing information helps you
(09:20) Information you probably can’t remove - and why this won’t work for superintelligent AIs
The original text contained 5 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---