Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Let’s use AI to harden human defenses against AI manipulation, published by Tom Davidson on May 17, 2023 on The AI Alignment Forum.
Views my own not my employers.
Summary
tldr: AI may manipulate humans; we can defend against that risk better by optimising AIs to manipulate humans, seeing what manipulation techniques they use, and learning to detect those techniques.
It’s critical that humans can detect manipulation from AIs for two reasons. Firstly, so that we don’t reward AIs for manipulative behaviour (outer alignment). Secondly, so that we can block attempts at AI takeover that run through manipulating humans.
Many standard techniques in alignment can be directed towards this goal. Using debate, we can reward one AI for persuading a human that another AI is being manipulative. The first AI could use techniques from interpretability and cross examination.
This post discusses a complementary approach, where AIs do “gain of function” research to i) discover techniques for manipulating humans and ii) harden human defenses against these techniques (with AI assistance). In other words, AIs are optimised to find “human vulnerabilities” and then patch those vulnerabilities.
The positive outcomes are:
Humans (with AI assistance) have stronger defenses against manipulation.
AIs are trained to avoid being manipulative and to instead use persuasion techniques that favour finding true conclusions, e.g. deductive reasoning.
As a consequence, AI debates favour truth more strongly.
By getting AI to research manipulation techniques, this approach takes advantage of the fact that, as we approach TAI, AIs will be able to do lots of cognitive work.
How might this look concretely? Here’s a very brief example:
Optimize a (possibly superhuman) “persuader-AI” to convince humans of true and false statements, where we know the ground truth.
The persuader-AI should manipulate humans if this makes it more convincing.
This step should be conducted in a secure environment to ensure that the manipulative persuader-AI doesn’t escape.
Optimize a (possibly superhuman) analyser-AI to identify human-understandable manipulation techniques that were used in #1.
This step can also be done in a secure environment, allowing us to use a powerful analyser-AI at lower risk.
Train limited detector-AIs to recognise the manipulation techniques found in #2.
Detector-AIs can then:
help humans detect manipulation techniques during AI debates, pushing the outcome towards the truth
monitor for AI manipulation in situations where full-blown debate isn’t practical (e.g. chatbots)
give AIs higher reward for using non-manipulative persuasion techniques and lower reward for using manipulative persuasion techniques
The techniques are all human-understandable, so humans can randomly check the accuracy of the AI outputs in #2, #3 and #4.
Some benefits of this approach:
Analyser-AIs, who identify manipulation techniques, can be trained using “process-based” optimisation, which is probably safer than “outcomes-based” optimisation.
Detector-AIs, who detect manipulation techniques, can be myopic (planning only over short time horizons) and limited in other ways. This makes collusion between AIs harder. (Collusion is a big potential pitfall of debate.)
This process for hardening human defenses is inherently modular. It involves multiple steps, and AIs performing each step need not understand the full process. This makes collusion harder.
Human defenses can stay “one step ahead” of AI offense even as each generation of AIs enables new manipulation techniques. For each new generation of AIs:
Discover the new manipulation techniques enabled by the new generation of AIs:
Optimize the new generation of AIs to manipulate humans in a secure test environment.
Identify the new manipulation techniques they use
Train detector-AIs to recog...