April 05, 2023

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

39 minutes

---
client: lesswrong
project_id: curated
feed_id: ai_safety
narrator: pw
qa: mds
qa_time: 1h00m
---

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

Original article:
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

View all episodes

By TYPE III AUDIO

April 05, 2023

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

39 minutes

---
client: lesswrong
project_id: curated
feed_id: ai_safety
narrator: pw
qa: mds
qa_time: 1h00m
---

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

Original article:
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

Share "Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

Sign up to save your podcasts

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky