Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes - Transcript, published by Andrea Miotti on February 24, 2023 on LessWrong.
The following is the transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format.
You can read a summary of the discussion here. Note this that transcript has been lightly edited for readability.
Introduction
[GA]
let's start?
[Christiano]
sounds good
[GA]
Cool, just copy-pasting our two selections of topic [editor's note: from an email exchange before the discussion]:
“[Topics sent by Christiano]
Probability of deceptive alignment and catastrophic reward hacking.
How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work)
How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.
Feasibility of measuring and establishing consensus about risk.
Takeoff speeds, and practicality of delegating alignment to AI systems.
Other sources of risk beyond those in Christiano's normal model. Probably better for GA to offer some pointers here.”
“[Topics sent by GA]
How much will reinforcement learning with human feedback and other related approaches (e.g., debate) lead to progress on prosaic alignment? (similar to Christiano's point number 2 above)
How much can we rely on unaligned AIs to bootstrap aligned ones? (in the general category of "use relatively unaligned AI to align AI", and matching Christiano's second part of point number 5 above)
At the current pace of capabilities progress vis-a-vis prosaic alignment progress, will we be able to solve alignment on time?
General discussions on the likelihood of a sharp left turn, how it will look like and how to address it. (related to "takeoff speeds" above, in point number 5 above)
AGI timelines / AGI doom probability”
[Christiano]
I would guess that you know my view on these questions better than I know your view
I have a vague sense that you have a very pessimistic outlook, but don’t really know anything about why you are pessimistic (other than guessing it is similar to the reasons that other people are pessimistic)
[GA]
Then I guess I am more interested in
“- How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work)
How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.”
as these are where most of my pessimism is coming from
> [Christiano]: “(other than guessing it is similar to the reasons that other people are pessimistic)”
I guess I could start with this
[Christiano]
it seems reasonable to either talk about particular mitigations and whether they are likely to work, or to try to talk about some underlying reason that nothing is likely to work
Alignment Difficulty
[GA]
I think the mainline for my pessimism is:
There is an AGI race to the bottom
Alignment is hard in specific ways that we are bad at dealing with (for instance: we are bad at predicting phase shifts)
We don't have a lot of time to get better, given the pace of the race
[Christiano]
(though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution)
[GA]
> [Christiano] “(though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution)”
We can also explore this...