
Sign up to save your podcasts
Or
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Control
These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.
Basic open questions in control
---
Outline:
(00:36) Control
(00:47) Basic open questions in control
(01:46) Monitoring protocols
(02:49) Untrusted monitoring and collusion
(03:31) Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)
(04:05) Synthetic information and inputs
(04:38) Training-time alignment methods
(04:42) Science of (mis-)alignment
(05:23) Alignment / training schemes
(05:53) RL and Reward Hacking
(06:20) Better understanding and interpretability
(07:43) Other
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Control
These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.
Basic open questions in control
---
Outline:
(00:36) Control
(00:47) Basic open questions in control
(01:46) Monitoring protocols
(02:49) Untrusted monitoring and collusion
(03:31) Elicitation, sandbagging, and diffuse threats (e.g. research sabotage)
(04:05) Synthetic information and inputs
(04:38) Training-time alignment methods
(04:42) Science of (mis-)alignment
(05:23) Alignment / training schemes
(05:53) RL and Reward Hacking
(06:20) Better understanding and interpretability
(07:43) Other
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
26,469 Listeners
2,395 Listeners
7,928 Listeners
4,142 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
417 Listeners
5,448 Listeners
15,237 Listeners
481 Listeners
121 Listeners
75 Listeners
461 Listeners