March 07, 2023

LW - [Linkpost] Talk on DeepMind alignment strategy by Vika

1 minute

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Talk on DeepMind alignment strategy, published by Vika on March 7, 2023 on LessWrong.

I recently gave a talk about DeepMind's alignment strategy at the SERI MATS seminar, sharing the slides here for anyone interested. This is an overview of our threat models, our high-level current plan, and how current projects fit into this plan.

Disclaimer: this talk represents the views of the alignment team and is not officially endorsed by DeepMind.

Our high level approach to alignment is to try to direct the training process towards aligned AI and away from misaligned AI. To illustrate this, imagine we have a space of possible models, where the red areas consist of misaligned models that are highly competent and cause catastrophic harm, and the blue areas consist of aligned models that are highly competent and don't cause catastrophic harm. The training process moves through this space and by default ends up in a red area consisting of misaligned models. We aim to identify some key point on this path, for example a point where deception was rewarded, and apply some alignment technique that directs the training process to a blue area of aligned models instead. Check out the slides for more details!

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more