January 10, 2023

AF - The Alignment Problem from a Deep Learning Perspective (major rewrite) by SoerenMind

1 hour 29 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Alignment Problem from a Deep Learning Perspective (major rewrite), published by SoerenMind on January 10, 2023 on The AI Alignment Forum.

We've recently uploaded a major rewrite of Richard Ngo's:

The Alignment Problem from a Deep Learning Perspective

We hope it can reach ML researchers by being more grounded in the deep learning literature and empirical findings, and being more rigorous than typical introductions to the alignment problem.

There are many changes. Most obviously, the paper was previously structured into three phases of training: planning towards a mix of desirable and undesirable internally-represented goals, situational awareness, and goals that generalize OOD. Now it is structured around three emergent phenomena: deceptive reward hacking, planning towards internally represented goals, and power-seeking.

Feedback request

Since we're submitting this to ICML in two weeks, we're looking for feedback, including feedback on presentation and feedback you'd give as a critical ML reviewer. It can be nitpicky. If your feedback isn't that interesting to forum readers, you may want to email it to us (find our emails in the Arxiv PDF). It's most useful by 17 January. Many thanks in advance!

Full text copy

Richard Ngo - OpenAI, Lawrence Chan - UC Berkeley (EECS), Sören Mindermann - University of Oxford (CS)

Abstract

Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. We outline a case for expecting that, without substantial effort to prevent it, AGIs could learn to pursue goals which are very undesirable (in other words, misaligned) from a human perspective. We argue that AGIs trained in similar ways as today's most capable models could learn to act deceptively to receive higher reward; learn internally-represented goals which generalize beyond their training distributions; and pursue those goals using power-seeking strategies. We outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and briefly review research directions aimed at preventing these problems.

Contents

(Page numbers from the PDF version)

1 Introduction 2

2 Deceptive reward hacking 3

2.1 Reward misspecification and reward hacking ........ 3

2.2 Defining situational awareness ....... . . . . . . . . . . 4

2.3 Situational awareness enables deceptive reward hacking ...... 4

3 Internally-represented goals 5

4 Power-seeking behavior 7

4.1 Many broadly-scoped goals incentivize power-seeking ....... 8

4.2 Power-seeking policies would choose high-reward behaviors for instrumental reasons ..8

4.3 Misaligned AGIs could gain control of key levers of power .... 9

5 Research directions in alignment 9

6 Conclusion 10

Introduction

Over the last decade, advances in deep learning have led to the development of large neural networks with impressive capabilities in a wide range of domains. In addition to reaching human-level performance on complex games like Starcraft [Vinyals et al., 2019] and Diplomacy [Bakhtin et al., 2022], large neural networks show evidence of increasing generality [Bommasani et al., 2021], including advances in sample efficiency [Brown et al., 2020, Dorner, 2021], cross-task generalization [Adam et al., 2021], and multi-step reasoning [Chowdhery et al., 2022]. The rapid pace of these advances highlights the possibility that, within the coming decades, we develop artificial general intelligence (AGI) - that is, AI which can apply domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks relevant to the real world (such as writing software, formulating new scientific theories, or running a company).

1 This possibility is taken seriously by leading ML researchers, who in two recent...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 10, 2023

AF - The Alignment Problem from a Deep Learning Perspective (major rewrite) by SoerenMind

1 hour 29 minutes

We've recently uploaded a major rewrite of Richard Ngo's:

The Alignment Problem from a Deep Learning Perspective

We hope it can reach ML researchers by being more grounded in the deep learning literature and empirical findings, and being more rigorous than typical introductions to the alignment problem.

Feedback request

Full text copy

Richard Ngo - OpenAI, Lawrence Chan - UC Berkeley (EECS), Sören Mindermann - University of Oxford (CS)

Abstract

Contents

(Page numbers from the PDF version)

1 Introduction 2

2 Deceptive reward hacking 3

2.1 Reward misspecification and reward hacking ........ 3

2.2 Defining situational awareness ....... . . . . . . . . . . 4

2.3 Situational awareness enables deceptive reward hacking ...... 4

3 Internally-represented goals 5

4 Power-seeking behavior 7

4.1 Many broadly-scoped goals incentivize power-seeking ....... 8

4.2 Power-seeking policies would choose high-reward behaviors for instrumental reasons ..8

4.3 Misaligned AGIs could gain control of key levers of power .... 9

5 Research directions in alignment 9

6 Conclusion 10

Introduction

1 This possibility is taken seriously by leading ML researchers, who in two recent...

...more

Share AF - The Alignment Problem from a Deep Learning Perspective (major rewrite) by SoerenMind

Sign up to save your podcasts

AF - The Alignment Problem from a Deep Learning Perspective (major rewrite) by SoerenMind

AF - The Alignment Problem from a Deep Learning Perspective (major rewrite) by SoerenMind