Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: World-Model Interpretability Is All We Need, published by Thane Ruthenis on January 14, 2023 on The AI Alignment Forum.
Summary, by sections:
Perfect world-model interpretability seems both sufficient for robust alignment (via a decent variety of approaches) and realistically attainable (compared to "perfect interpretability" in general, i. e. insight into AIs' heuristics, goals, and thoughts as well). Main arguments: the NAH + internal interfaces.
There's plenty of reasons to think that world-models would converge towards satisfying a lot of nice desiderata: they'd be represented as a separate module in AI cognitive architecture, and that module would consists of many consistently-formatted sub-modules representing recognizable-to-us concepts. Said "consistent formatting" may allow us to, in a certain sense, interpret the entire world-model in one fell swoop.
We already have some rough ideas on how the data in world-models would be formatted, courtesy of the NAH. I also offer some rough speculations on possible higher-level organizing principles.
This avenue of research also seems very tractable. It can be approached from a wide variety of directions, and should be, to an extent, decently factorizable. Optimistically, it may constitute a relatively straight path from here to a "minimum viable product" for alignment, even in words where alignment is really hard.
1. Introduction
1A. Why Aim For This?
Imagine that we develop interpretability tools that allow us to flexibly understand and manipulate an AGI's world-model — but only its world-model. We would be able to see what the AGI knows, add or remove concepts from its mental ontology, and perhaps even use its world-model to run simulations/counterfactuals. But its thoughts and plans, and its hard-coded values and shards, would remain opaque to us. Would that be sufficient for robust alignment?
I argue it would be.
Primarily, this would solve the Pointers Problem. A central difficulty of alignment is that our values are functions of highly abstract variables, and that makes it hard to point an AI at them, instead of at easy-to-measure, shallow functions over sense-data. Cracking open a world-model would allow us to design metrics that have depth.
From there, we'd have several ways to proceed:
Fine-tune the AI to point more precisely at what we want (such as "human values" or "faithful obedience"), instead of its shallow correlates.
This would also solve the ELK, which alone can be used as a lever to solve the rest of alignment.
Alternatively, this may lower the difficulty of retargeting the search — we won't necessarily need to find the retargetable process, only the target.
Discard everything of the AGI except the interpreted world-model, then train a new policy function over that world-model (in a fashion similar to this), that'll be pointed at the "deep" target metric from the beginning.
The advantage of this approach over (1) is that in this case, our policy function wouldn't be led astray by any values/mesa-objectives it might've already formed.
With some more insight into how agency/intelligence works, perhaps we'll be able to manually write a general-purpose search algorithm over that world-model. In a sense, "general-purpose search" is just a principled way of drawing upon the knowledge contained in the world-model, after all — the GPS itself is probably fairly simple.
Taking this path would give us even more control over how our AI works than (2), potentially allowing us to install some very nuanced counter-measures.
That leaves open the question of the "target metric". It primarily depends on what will be easy to specify — what concepts we'll find in the interpreted world-model. Some possibilities:
Human values. Prima facie, "what this agent values" seems like a natural abstraction, one that we'd expect to ...