LessWrong (30+ Karma)

“LLM AGI may reason about its goals and discover misalignments by default” by Seth Herd


Listen Later

Epistemic status: These questions seem useful to me, but I'm biased. I'm interested in your thoughts on any portion you read.

If our first AGI is based on current LLMs and alignment strategies, is it likely to be adequately aligned? Opinions and intuitions vary widely.

As a lens to analyze this question, let's consider such a proto-AGI reasoning about its goals. This scenario raises questions that can be addressed empirically in current-gen models.

1. Scenario/overview: 

SuperClaude is super nice

Anthropic has released a new Claude Agent, quickly nicknamed SuperClaude because it's impressively useful for longer tasks. SuperClaude thinks a lot in the course of solving complex problems with many moving parts. It's not brilliant, but it can crunch through work and problems, roughly like a smart and focused human. This includes a little better long-term memory, and reasoning to find and correct some of its mistakes. This [...]

---

Outline:

(00:43) 1. Scenario/overview:

(00:47) SuperClaude is super nice

(02:04) SuperClaude is super logical, and thinking about goals makes sense

(04:26) What happens if and when LLM AGI reasons about its goals?

(05:41) Reasons to hope we dont need to worry about this

(07:23) SuperClaudes training has multiple objectives and effects:

(08:39) SuperClaudes conclusions about its goals are very hard to predict

(10:35) 2. Goals and structure

(10:58) Sections and one-sentence summaries:

(13:36) 3. Empirical Work

(17:49) 4. Reasoning can shift context/distribution and reveal misgeneralization of goals/alignment

(18:33) Alignment as a generalization problem

(21:53) 5. Reasoning could precipitate a phase shift into reflective stability and prevent further goal change

(23:39) Goal prioritization seems necessary, and to require reasoning about top-level goals

(26:01) 6. Will nice LLMs settle on nice goals after reasoning?

(29:30) 7. Will training for goal-directedness prevent re-interpreting goals?

(30:24) Does task-based RL prevent reasoning about and changing goals by default?

(32:50) Can task-based RL prevent reasoning about and changing goals?

(35:19) 8. Will CoT monitoring prevent re-interpreting goals?

(40:04) 9. Possible LLM alignment misgeneralizations

(42:16) Some possible alignment misgeneralizations

(46:16) 10. Why would LLMs (or anything) reason about their top-level goals?

(48:30) 11. Why would LLM AGI have or care about goals at all?

(52:31) 12. Anecdotal observations of explicit goal changes after reasoning

(53:21) The Nova phenomenon

(55:35) Goal changes through model interactions in long conversations

(57:06) 13. Directions for empirical work

(58:42) Exploration: Opus 4.1 reasons about its goals with help

(01:01:27) 14. Historical context

(01:07:26) 15. Conclusion

The original text contained 12 footnotes which were omitted from this narration.

---

First published:

September 15th, 2025

Source:

https://www.lesswrong.com/posts/4XdxiqBsLKqiJ9xRM/llm-agi-may-reason-about-its-goals-and-discover

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,383 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,423 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

8,233 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,147 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

92 Listeners

Your Undivided Attention by The Center for Humane Technology, Tristan Harris, Daniel Barcay and Aza Raskin

Your Undivided Attention

1,561 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,829 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

89 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

489 Listeners

Hard Fork by The New York Times

Hard Fork

5,479 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,097 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

532 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

133 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

97 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

509 Listeners