To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make successor agents -- and be able to solve inner alignment as a convergent instrumental goal -- we must also consider that inner alignment for successor systems can be solved much easier than for humans, as the prior AIs can be embedded in the successor. The entire (likely much smaller) prior model can be run many times more than the successor model, to help MCTS whatever plans it's considering in the context of the goals of the designer model.
I've been thinking about which parts of AI 2027 are the weakest, and this seems like the biggest gap.[1] Given this scenario otherwise seems non-ridiculous, we should have a fairly ambitious outer alignment plan meant to compliment it, otherwise it seems extraordinarily unlikely that the convergent alignment research would [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
June 9th, 2025
Source:
https://www.lesswrong.com/posts/rpKPgzjr3tPkDZChg/outer-alignment-is-the-necessary-compliment-to-ai-2027-s
---
Narrated by TYPE III AUDIO.