The Nonlinear Library: Alignment Forum

AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Shortest Path Between Scylla and Charybdis, published by Thane Ruthenis on December 18, 2023 on The AI Alignment Forum.
tl;dr: There's two diametrically opposed failure modes an alignment researcher can fall into: engaging in excessively concrete research whose findings won't timely generalize to AGI, and engaging in excessively abstract research whose findings won't timely connect to the practical reality.
Different people's assessments of what research is too abstract/concrete differ significantly based on their personal AI-Risk models. One person's too-abstract can be another's too-concrete.
The meta-level problem of alignment research is to pick a research direction that, on your subjective model of AI Risk, strikes a good balance between the two - and thereby arrives at the solution to alignment in as few steps as possible.
Introduction
Suppose that you're interested in solving AGI Alignment. There's a dizzying plethora of approaches to choose from:
What behavioral properties do the current-best AIs exhibit?
Can we already augment our research efforts with the AIs that exist today?
How far can "straightforward" alignment techniques like RLHF get us?
Can an AGI be born out of an AutoGPT-like setup? Would our ability to see its externalized monologue suffice for nullifying its dangers?
Can we make AIs-aligning-AIs work?
What are the mechanisms by which the current-best AIs function? How can we precisely intervene on their cognition in order to steer them?
What are the remaining challenges of scalable interpretability, and how can they be defeated?
What features do agenty systems convergently learn when subjected to selection pressures?
Is there such a thing as "natural abstractions"? How do we learn them?
What is the type signature of embedded agents and their values? What about the formal description of corrigibility?
What is the "correct" decision theory that an AGI would follow? And what's up with anthropic reasoning?
Et cetera, et cetera.
So... How the hell do you pick what to work on?
The starting point, of course, would be building up your own model of the problem. What's the nature of the threat? What's known about how ML models work? What's known about agents, and cognition? How does any of that relate to the threat? What are all extant approaches? What's each approach's theory-of-impact? What model of AI Risk does it assume? Does it agree with your model? Is it convincing? Is it tractable?
Once you've done that, you'll likely have eliminated a few approaches as obvious nonsense. But even afterwards, there might still be multiple avenues left that all seem convincing. How do you pick between those?
Personal fit might be one criterion. Choose the approach that best suits your skills and inclinations and opportunities. But that's risky: if you make a mistake, and end up working on something irrelevant just because it suits you better, you'll have multiplied your real-world impact by zero. Conversely, contributing to a tractable approach would be net-positive, even if you'd be working at a disadvantage. And who knows, maybe you'll find that re-specializing is surprisingly easy!
So what further objective criteria can you evaluate?
Regardless of one's model of AI Risk, there's two specific, diametrically opposed failure modes that any alignment researcher can fall into: being too concrete, and being too abstract.
The approach to choose should be one that maximizes the distance from both failure modes.
The Scylla: Atheoretic Empiricism
One pitfall would be engaging in research that doesn't generalize to aligning AGI.
An ad-absurdum example: You pick some specific LLM model, then start exhaustively investigating how it responds to different prompts, and what quirks it has. You're building giant look-up tables of "query, response", with no overarching structur...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners