A new version of “Intro to Brain-Like-AGI Safety” is out!
Things that have not changed
Same links as before:
- LessWrong / Alignment Forum blog version: https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8
- Archival PDF version (now version 3!): https://osf.io/preprints/osf/fe36n
- Summary video: Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI
…And same abstract as before:
Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?
I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of unsolved problems as I see them.
Post #1 contains definitions, background, and motivation. Then Posts #2–#7 are the neuroscience, arguing for a picture of he brain that combines large-scale learning algorithms (e.g. in the cortex) and specific evolved reflexes (e.g. in the hypothalamus and brainstem). Posts #8–#15 apply those neuroscience ideas directly to AGI safety, ending with a list of open questions and advice for getting involved in the field.
A major theme will be that the human brain runs a yet-to-be-invented variation on Model-Based Reinforcement Learning. [...]
---
Outline:
(00:16) Things that have not changed
(02:40) Highlights from the changelog
(02:57) Post 1: Whats the problem & Why work on it now?
(03:03) What is AGI?
(05:49) More responses to intelligence denialists
(08:27) Post 2: Learning from scratch in the brain
(08:32) Better overview of the discourse
(12:05) Plasticity
(14:20) Interpretability
(16:18) Post 3: Two subsystems: Learning & Steering
(16:24) My timelines prediction
(17:24) Responses to bad takes on acting under uncertainty
(19:06) Post 5. The long-term predictor, and TD learning
(19:13) More pedagogy on the toy model
(20:11) Post 6: Big picture of motivation, decision-making, and RL
(20:18) More on why ego-syntonic goals are in the hypothalamus & brainstem
(22:45) Post 10: The technical alignment problem
(22:51) LLMs
(23:09) 10.3.1.1. Didn't LLMs solve the Goodhart's Law problem?
(25:58) Instrumental convergence & consequentialist preferences
(26:14) 10.3.2.3. Motivations that don't lead to instrumental convergence
(28:18) What about RL today?
(31:24) What do I mean by (technical) alignment?
(34:34) Post 12: Two paths forward: Controlled AGI and Social-instinct AGI
(34:43) What exactly is the RL training environment?
(37:08) Post 15: Conclusion: Open problems, how to help, AMA
(37:15) Reward Function Design
(38:59) Conclusion
---