The Nonlinear Library

LW - How to safely use an optimizer by Simon Fischer


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to safely use an optimizer, published by Simon Fischer on March 29, 2024 on LessWrong.
Summary: The post describes a method that allows us to use an untrustworthy optimizer to find satisficing outputs.
Acknowledgements: Thanks to Benjamin Kolb (@benjaminko), Jobst Heitzig (@Jobst Heitzig) and Thomas Kehrenberg (@Thomas Kehrenberg) for many helpful comments.
Introduction
Imagine you have black-box access to a powerful but untrustworthy optimizing system, the Oracle. What do I mean by "powerful but untrustworthy"? I mean that, when you give an objective function f as input to the Oracle, it will output an element x that has an impressively low[1] value of f(x). But sadly, you don't have any guarantee that it will output the optimal element and e.g. not one that's also chosen for a different purpose (which might be dangerous for many reasons, e.g.
instrumental convergence).
What questions can you safely ask the Oracle? Can you use it to create utopia by asking for designs of machines, proteins, computer programs, etc.? Or are you risking the destruction of everything that we value if you dare to use such designs?
As an example, the Oracle could be a learned system; in that case, the topic of this post would be finding a way to get useful work out of the Oracle despite its inner misalignment.
In this post I'll describe a technique that allows us to safely use the Oracle under fairly weak assumptions. This approach can also be considered to be a way of controlling arbitrarily powerful AI systems.
One neat trick
This isn't fair, isn't fair, isn't fair! There's a limit to how many constraints you can add to a problem before it really is impossible!
(Harry Potter and the Methods of Rationality, Chapter 56)
Let O be a finite set of possible outputs of the Oracle (e.g. strings of length at most l) and f:OR be our objective function. Let's assume we are happy with an output that satisfices; i.e. we want to find an output x such that the value of f(x) is lower than some threshold c. Let S={xOf(x)
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings