April 30, 2025

“Can we safely automate alignment research?” by Joe Carlsmith

1 hour 58 minutes

(This is the fifth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.

Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

See also here for video and transcript of a talk on this topic that I gave at Anthropic in April 2025. And see here for slides.)

1. Introduction

In my last essay, I argued that we should try extremely hard to use AI labor to improve our civilization's capacity to handle the alignment problem – a project I called “AI for AI safety.” In this essay, I want to look in more [...]

---

Outline:

(00:43) 1. Introduction

(03:16) 1.1 Executive summary

(14:11) 2. Why is automating alignment research so important?

(16:14) 3. Alignment MVPs

(19:54) 3.1 What if neither of these approaches are viable?

(21:55) 3.2 Alignment MVPs don't imply hand-off

(23:41) 4. Why might automated alignment research fail?

(29:25) 5. Evaluation failures

(30:46) 5.1 Output-focused and process-focused evaluation

(34:09) 5.2 Human output-focused evaluation

(36:11) 5.3 Scalable oversight

(39:29) 5.4 Process-focused techniques

(43:14) 6 Comparisons with other domains

(44:04) 6.1 Taking comfort in general capabilities problems?

(49:18) 6.2 How does our evaluation ability compare in these different domains?

(49:45) 6.2.1 Number go up

(50:57) 6.2.2 Normal science

(57:55) 6.2.3 Conceptual research

(01:04:10) 7. How much conceptual alignment research do we need?

(01:04:36) 7.1 How much for building superintelligence?

(01:05:25) 7.2 How much for building an alignment MVP?

(01:07:42) 8. Empirical alignment research is extremely helpful for automating conceptual alignment research

(01:09:32) 8.1 Automated empirical research on scalable oversight

(01:14:21) 8.2 Automated empirical research on process-focused evaluation methods

(01:19:24) 8.3 Other ways automated empirical alignment research can be helpful

(01:20:44) 9. What about scheming?

(01:24:20) 9.1 Will AIs capable of top-human-level alignment research be schemers by default?

(01:26:41) 9.2 If these AIs would be schemers by default, can we detect and prevent this scheming?

(01:28:43) 9.3 Can we elicit safe alignment research from scheming AIs?

(01:32:09) 10. Resource problems

(01:35:54) 11. Alternatives to automating alignment research

(01:39:30) 12. Conclusion

(01:41:29) Appendix 1: How do these failure modes apply to other sorts of AI for AI safety?

(01:43:55) Appendix 2: Other practical concerns not discussed in the main text

(01:48:21) Appendix 3: On various arguments for the inadequacy of empirical alignment research

(01:55:50) Appendix 4: Does using AIs for alignment research require that they engage with too many dangerous topics/domains?

The original text contained 64 footnotes which were omitted from this narration.

---

First published:

April 30th, 2025

Source:

https://www.lesswrong.com/posts/nJcuj4rtuefeTRFHp/can-we-safely-automate-alignment-research

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

April 30, 2025

“Can we safely automate alignment research?” by Joe Carlsmith

1 hour 58 minutes

Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

See also here for video and transcript of a talk on this topic that I gave at Anthropic in April 2025. And see here for slides.)

1. Introduction

---

Outline:

(00:43) 1. Introduction

(03:16) 1.1 Executive summary

(14:11) 2. Why is automating alignment research so important?

(16:14) 3. Alignment MVPs

(19:54) 3.1 What if neither of these approaches are viable?

(21:55) 3.2 Alignment MVPs don't imply hand-off

(23:41) 4. Why might automated alignment research fail?

(29:25) 5. Evaluation failures

(30:46) 5.1 Output-focused and process-focused evaluation

(34:09) 5.2 Human output-focused evaluation

(36:11) 5.3 Scalable oversight

(39:29) 5.4 Process-focused techniques

(43:14) 6 Comparisons with other domains

(44:04) 6.1 Taking comfort in general capabilities problems?

(49:18) 6.2 How does our evaluation ability compare in these different domains?

(49:45) 6.2.1 Number go up

(50:57) 6.2.2 Normal science

(57:55) 6.2.3 Conceptual research

(01:04:10) 7. How much conceptual alignment research do we need?

(01:04:36) 7.1 How much for building superintelligence?

(01:05:25) 7.2 How much for building an alignment MVP?

(01:07:42) 8. Empirical alignment research is extremely helpful for automating conceptual alignment research

(01:09:32) 8.1 Automated empirical research on scalable oversight

(01:14:21) 8.2 Automated empirical research on process-focused evaluation methods

(01:19:24) 8.3 Other ways automated empirical alignment research can be helpful

(01:20:44) 9. What about scheming?

(01:24:20) 9.1 Will AIs capable of top-human-level alignment research be schemers by default?

(01:26:41) 9.2 If these AIs would be schemers by default, can we detect and prevent this scheming?

(01:28:43) 9.3 Can we elicit safe alignment research from scheming AIs?

(01:32:09) 10. Resource problems

(01:35:54) 11. Alternatives to automating alignment research

(01:39:30) 12. Conclusion

(01:41:29) Appendix 1: How do these failure modes apply to other sorts of AI for AI safety?

(01:43:55) Appendix 2: Other practical concerns not discussed in the main text

(01:48:21) Appendix 3: On various arguments for the inadequacy of empirical alignment research

(01:55:50) Appendix 4: Does using AIs for alignment research require that they engage with too many dangerous topics/domains?

The original text contained 64 footnotes which were omitted from this narration.

---

First published:

April 30th, 2025

Source:

https://www.lesswrong.com/posts/nJcuj4rtuefeTRFHp/can-we-safely-automate-alignment-research

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,331 Listeners

Conversations with Tyler

2,382 Listeners

The Peter Attia Drive

8,007 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,126 Listeners

ManifoldOne

90 Listeners

Your Undivided Attention

1,479 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,216 Listeners

Machine Learning Street Talk (MLST)

91 Listeners

Dwarkesh Podcast

424 Listeners

Hard Fork

5,437 Listeners

The Ezra Klein Show

15,363 Listeners

Moonshots with Peter Diamandis

501 Listeners

No Priors: Artificial Intelligence | Technology | Startups

129 Listeners

Latent Space: The AI Engineer Podcast

72 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

464 Listeners

Share “Can we safely automate alignment research?” by Joe Carlsmith

Sign up to save your podcasts

“Can we safely automate alignment research?” by Joe Carlsmith

“Can we safely automate alignment research?” by Joe Carlsmith

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley