
Sign up to save your podcasts
Or


Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Control
These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.
Basic open questions in control
---
Outline:
(00:36) Control
(00:47) Basic open questions in control
(01:46) Monitoring protocols
(02:49) Untrusted monitoring and collusion
(03:31) Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)
(04:05) Synthetic information and inputs
(04:38) Training-time alignment methods
(04:42) Science of (mis-)alignment
(05:23) Alignment / training schemes
(05:53) RL and Reward Hacking
(06:20) Better understanding and interpretability
(07:43) Other
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongPreviously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Control
These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.
Basic open questions in control
---
Outline:
(00:36) Control
(00:47) Basic open questions in control
(01:46) Monitoring protocols
(02:49) Untrusted monitoring and collusion
(03:31) Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)
(04:05) Synthetic information and inputs
(04:38) Training-time alignment methods
(04:42) Science of (mis-)alignment
(05:23) Alignment / training schemes
(05:53) RL and Reward Hacking
(06:20) Better understanding and interpretability
(07:43) Other
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,386 Listeners

2,419 Listeners

8,916 Listeners

4,153 Listeners

92 Listeners

1,595 Listeners

9,900 Listeners

90 Listeners

76 Listeners

5,470 Listeners

16,026 Listeners

539 Listeners

130 Listeners

94 Listeners

504 Listeners