LessWrong (Curated & Popular)

“AIs Will Increasingly Attempt Shenanigans” by Zvi


Listen Later

Increasingly, we have seen papers eliciting in AI models various shenanigans.

There are a wide variety of scheming behaviors. You’ve got your weight exfiltration attempts, sandbagging on evaluations, giving bad information, shielding goals from modification, subverting tests and oversight, lying, doubling down via more lying. You name it, we can trigger it.

I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent).

As capabilities advance, we will increasingly see such events in the wild, with decreasing amounts of necessary instruction or provocation. Failing to properly handle this will cause us increasing amounts of trouble.

Telling ourselves it is only because we told them to do it [...]

---

Outline:

(01:07) The Discussion We Keep Having

(03:36) Frontier Models are Capable of In-Context Scheming

(06:48) Apollo In-Context Scheming Paper Details

(12:52) Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’

(17:40) OK, Fine, Let's Have the Discussion We Keep Having

(18:26) How Apollo Sees Its Own Report

(21:13) We Will Often Tell LLMs To Be Scary Robots

(26:25) Oh The Scary Robots We’ll Tell Them To Be

(27:48) This One Doesn’t Count Because

(31:11) The Claim That Describing What Happened Hurts The Real Safety Work

(46:17) We Will Set AIs Loose On the Internet On Purpose

(49:56) The Lighter Side

The original text contained 11 images which were described by AI.

---

First published:
December 16th, 2024

Source:
https://www.lesswrong.com/posts/v7iepLXH2KT4SDEvB/ais-will-increasingly-attempt-shenanigans

---

Narrated by TYPE III AUDIO.

---

Images from the article:

...more
View all episodesView all episodes
Download on the App Store

LessWrong (Curated & Popular)By LessWrong

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

12 ratings


More shows like LessWrong (Curated & Popular)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,396 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,425 Listeners

Robert Wright's Nonzero by Nonzero

Robert Wright's Nonzero

590 Listeners

Future of Life Institute Podcast by Future of Life Institute

Future of Life Institute Podcast

107 Listeners

The Good Fight by Yascha Mounk

The Good Fight

903 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

92 Listeners

The Prof G Pod with Scott Galloway by Vox Media Podcast Network

The Prof G Pod with Scott Galloway

5,467 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

90 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

497 Listeners

Hard Fork by The New York Times

Hard Fork

5,463 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

131 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

133 Listeners

The Marginal Revolution Podcast by Mercatus Center at George Mason University

The Marginal Revolution Podcast

93 Listeners

Statecraft by Santi Ruiz

Statecraft

35 Listeners

The Last Invention by Longview

The Last Invention

300 Listeners