
Sign up to save your podcasts
Or


Executive Summary
---
Outline:
(00:11) Executive Summary
(02:57) Theories Of Change
(04:25) Science Of Misalignment
(06:59) Empowering Other Areas Of AGI Safety
(07:17) Evaluation Awareness
(07:53) Better Feedback On Safety Research
(08:11) Conceptual Progress On Model Psychology
(08:44) Maintaining Monitorability Of Chain Of Thought
(09:20) Preventing Egregiously Misaligned Actions
(11:20) Directly Helping Align Models
(13:30) Research Areas Directed Towards Theories Of Change
(13:53) Model Biology
(15:48) Helping Direct Model Training
(15:55) Monitoring
(17:10) Research Areas About Robustly Useful Settings
(18:00) Reasoning Model Interpretability
(22:42) Automating Interpretability
(23:51) Basic Science Of AI Psychology
(24:08) Finding Good Proxy Tasks
(25:05) Discovering Unusual Behaviours
(26:12) Data-Centric Interpretability
(27:16) Model Diffing
(28:05) Applied Interpretability
(29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques}
The original text contained 16 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongExecutive Summary
---
Outline:
(00:11) Executive Summary
(02:57) Theories Of Change
(04:25) Science Of Misalignment
(06:59) Empowering Other Areas Of AGI Safety
(07:17) Evaluation Awareness
(07:53) Better Feedback On Safety Research
(08:11) Conceptual Progress On Model Psychology
(08:44) Maintaining Monitorability Of Chain Of Thought
(09:20) Preventing Egregiously Misaligned Actions
(11:20) Directly Helping Align Models
(13:30) Research Areas Directed Towards Theories Of Change
(13:53) Model Biology
(15:48) Helping Direct Model Training
(15:55) Monitoring
(17:10) Research Areas About Robustly Useful Settings
(18:00) Reasoning Model Interpretability
(22:42) Automating Interpretability
(23:51) Basic Science Of AI Psychology
(24:08) Finding Good Proxy Tasks
(25:05) Discovering Unusual Behaviours
(26:12) Data-Centric Interpretability
(27:16) Model Diffing
(28:05) Applied Interpretability
(29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques}
The original text contained 16 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,936 Listeners

132 Listeners

7,283 Listeners

541 Listeners

16,372 Listeners

4 Listeners

14 Listeners

2 Listeners