The Nonlinear Library

AF - A Case for the Least Forgiving Take On Alignment by Thane Ruthenis


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Case for the Least Forgiving Take On Alignment, published by Thane Ruthenis on May 2, 2023 on The AI Alignment Forum.
1. Introduction
The field of AI Alignment is a pre-paradigmic one, and the primary symptom of that is the wide diversity of views across it. Essentially every senior researcher has their own research direction, their own idea of what the core problem is and how to go about solving it.
The differing views can be categorized along many dimensions. Here, I'd like to focus on a specific cluster of views, one corresponding to the most "hardcore", unforgiving take on AI Alignment. It's the view held by people like Eliezer Yudkowsky, Nate Soares, and John Wentworth, and not shared by Paul Christiano or the staff of major AI Labs.
According to this view:
We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.
We need to align the AGI's values precisely right. "Rough" alignment won't work, niceness is not convergent, alignment attained at a low level of capabilities is unlikely to scale to superintelligence.
"Dodging" the alignment problem won't work. We can't securely hamstring the AGI's performance in some domain without compromising the AGI completely. We can't make it non-consequentialist, non-agenty, non-optimizing, non-goal-directed, et cetera. It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.
Automating research is impossible. Pre-AGI oracles, simulators, or research assistants won't generate useful results; cyborgism doesn't offer much hope. Conversely, if one such system would have the capability to meaningfully contribute to alignment, it'd need to be aligned itself. Catch-22.
Weak interpretability tools won't generalize to the AGI stage, as wouldn't other methods of "supervising" or "containing" the AGI.
Strong interpretability, perhaps rooted in agent-foundations insights, is promising, but the bar there is fairly high.
In sum: alignment is hard and requires exacting precision, AI can't help us with it, and instantiating an AGI without robustly solving alignment is certain to kill us all.
I share this view. In my case, there's a simple generator of it; a single belief that causes my predictions to diverge sharply from the more optimistic models.
From one side, this view postulates a sharp discontinuity, a phase change. Once a system gets to AGI, its capabilities will skyrocket, while its internal dynamics will shift dramatically. It will break "nonrobust" alignment guarantees. It will start thinking in ways that confuse previous interpretability efforts. It will implement strategies it never thought of before.
From another side, this view holds that any system which doesn't have the aforementioned problems will be useless for intellectual progress. Can't have a genius engineer who isn't also a genius schemer; can't have a scientist-modeling simulator which doesn't wake up to being a shoggoth.
What ties it all together is the belief that the general-intelligence property is binary. A system is either an AGI, or it isn't, with nothing in-between. If it is, it's qualitatively more capable than any pre-AGI system, and also works in qualitatively different ways. If it's not, it's fundamentally "lesser" than any generally-intelligent system, and doesn't have truly transformative capabilities.
In the rest of this post, I will outline some arguments for this, sketch out what "general intelligence" means in this framing, do a case-study of LLMs ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings