
Sign up to save your podcasts
Or


Executive Summary
---
Outline:
(00:11) Executive Summary
(02:57) Theories Of Change
(04:25) Science Of Misalignment
(06:59) Empowering Other Areas Of AGI Safety
(07:17) Evaluation Awareness
(07:53) Better Feedback On Safety Research
(08:11) Conceptual Progress On Model Psychology
(08:44) Maintaining Monitorability Of Chain Of Thought
(09:20) Preventing Egregiously Misaligned Actions
(11:20) Directly Helping Align Models
(13:30) Research Areas Directed Towards Theories Of Change
(13:53) Model Biology
(15:48) Helping Direct Model Training
(15:55) Monitoring
(17:10) Research Areas About Robustly Useful Settings
(18:00) Reasoning Model Interpretability
(22:42) Automating Interpretability
(23:51) Basic Science Of AI Psychology
(24:08) Finding Good Proxy Tasks
(25:05) Discovering Unusual Behaviours
(26:12) Data-Centric Interpretability
(27:16) Model Diffing
(28:05) Applied Interpretability
(29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques}
The original text contained 16 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongExecutive Summary
---
Outline:
(00:11) Executive Summary
(02:57) Theories Of Change
(04:25) Science Of Misalignment
(06:59) Empowering Other Areas Of AGI Safety
(07:17) Evaluation Awareness
(07:53) Better Feedback On Safety Research
(08:11) Conceptual Progress On Model Psychology
(08:44) Maintaining Monitorability Of Chain Of Thought
(09:20) Preventing Egregiously Misaligned Actions
(11:20) Directly Helping Align Models
(13:30) Research Areas Directed Towards Theories Of Change
(13:53) Model Biology
(15:48) Helping Direct Model Training
(15:55) Monitoring
(17:10) Research Areas About Robustly Useful Settings
(18:00) Reasoning Model Interpretability
(22:42) Automating Interpretability
(23:51) Basic Science Of AI Psychology
(24:08) Finding Good Proxy Tasks
(25:05) Discovering Unusual Behaviours
(26:12) Data-Centric Interpretability
(27:16) Model Diffing
(28:05) Applied Interpretability
(29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques}
The original text contained 16 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,352 Listeners

2,454 Listeners

8,651 Listeners

4,175 Listeners

93 Listeners

1,600 Listeners

9,901 Listeners

93 Listeners

505 Listeners

5,530 Listeners

15,995 Listeners

542 Listeners

135 Listeners

95 Listeners

474 Listeners