February 08, 2025

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

1 hour 58 minutes

Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and approximately a hundred different example projects. We think this resource acts as a good starting point for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within or across research areas.

For each research area, we include:

Discussion of key technical problems and why they matter
Related work and important papers in the field
Example projects we'd be excited [...]

---

Outline:

(01:34) Synopsis of Research Areas

(02:29) Adversarial machine learning

(04:57) Exploring sophisticated misbehavior in LLMs

(07:26) Model transparency

(10:48) Trust from first principles

(12:09) Alternative approaches to mitigating AI risks

(13:06) The Research Areas

(13:10) \*Jailbreaks and unintentional misalignment

(20:32) \*Control evaluations

(24:52) \*Backdoors and other alignment stress tests

(29:36) \*Alternatives to adversarial training

(35:33) Robust unlearning

(38:21) \*Experiments on alignment faking

(44:00) \*Encoded reasoning in CoT and inter-model communication

(49:06) Black-box LLM psychology

(54:40) Evaluating whether models can hide dangerous behaviors

(59:16) Reward hacking of human oversight

(01:02:56) \*Applications of white-box techniques

(01:08:12) Activation monitoring

(01:14:23) Finding feature representations

(01:21:27) Toy models for interpretability

(01:27:11) Externalizing reasoning

(01:31:39) Interpretability benchmarks

(01:34:21) †More transparent architectures

(01:40:05) White-box estimation of rare misbehavior

(01:43:15) Theoretical study of inductive biases

(01:49:45) †Conceptual clarity about risks from powerful AI

(01:55:02) †New moonshots for aligning superintelligence

---

First published:

February 8th, 2025

Source:

https://www.lesswrong.com/posts/26SHhxK2yYQbh7ors/research-directions-open-phil-wants-to-fund-in-technical-ai

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

February 08, 2025

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

1 hour 58 minutes

For each research area, we include:

Discussion of key technical problems and why they matter
Related work and important papers in the field
Example projects we'd be excited [...]

---

Outline:

(01:34) Synopsis of Research Areas

(02:29) Adversarial machine learning

(04:57) Exploring sophisticated misbehavior in LLMs

(07:26) Model transparency

(10:48) Trust from first principles

(12:09) Alternative approaches to mitigating AI risks

(13:06) The Research Areas

(13:10) \*Jailbreaks and unintentional misalignment

(20:32) \*Control evaluations

(24:52) \*Backdoors and other alignment stress tests

(29:36) \*Alternatives to adversarial training

(35:33) Robust unlearning

(38:21) \*Experiments on alignment faking

(44:00) \*Encoded reasoning in CoT and inter-model communication

(49:06) Black-box LLM psychology

(54:40) Evaluating whether models can hide dangerous behaviors

(59:16) Reward hacking of human oversight

(01:02:56) \*Applications of white-box techniques

(01:08:12) Activation monitoring

(01:14:23) Finding feature representations

(01:21:27) Toy models for interpretability

(01:27:11) Externalizing reasoning

(01:31:39) Interpretability benchmarks

(01:34:21) †More transparent architectures

(01:40:05) White-box estimation of rare misbehavior

(01:43:15) Theoretical study of inductive biases

(01:49:45) †Conceptual clarity about risks from powerful AI

(01:55:02) †New moonshots for aligning superintelligence

---

First published:

February 8th, 2025

Source:

https://www.lesswrong.com/posts/26SHhxK2yYQbh7ors/research-directions-open-phil-wants-to-fund-in-technical-ai

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,329 Listeners

Conversations with Tyler

2,442 Listeners

The Peter Attia Drive

9,152 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,152 Listeners

ManifoldOne

92 Listeners

Your Undivided Attention

1,598 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,901 Listeners

Machine Learning Street Talk (MLST)

90 Listeners

Dwarkesh Podcast

505 Listeners

Hard Fork

5,472 Listeners

The Ezra Klein Show

16,038 Listeners

Moonshots with Peter Diamandis

539 Listeners

No Priors: Artificial Intelligence | Technology | Startups

133 Listeners

Latent Space: The AI Engineer Podcast

95 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

515 Listeners

Share “Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

Sign up to save your podcasts

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley