January 14, 2023

AF - Concrete Reasons for Hope about AI by Zac Hatfield-Dodds

9 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Concrete Reasons for Hope about AI, published by Zac Hatfield-Dodds on January 14, 2023 on The AI Alignment Forum.

Recent advances in machine learning—in reinforcement learning, language modeling, image and video generation, translation and transcription models, etc.—without similarly striking safety results, have rather dampened the mood in many AI Safety circles. If I was any less concerned by extinction risks from AI, I would have finished my PhD[1] as planned before moving from Australia to SF to work at Anthropic; I believe that the situation is both urgent and important.[2]

On the other hand, despair is neither instrumentally nor terminally valuable.[3] This essay therefore lays out some concrete reasons for hope, which might help rebalance the emotional scales and offer some directions to move in.

Background: a little about Anthropic

I must emphasize here that this essay represents only my own views, and not those of my employer. I’ll try to make this clear by restricting we to actions, and using I for opinions to avoid attributing my own views to my colleagues. Please forgive any lapses of style or substance.

Anthropic’s raison d’etre is AI Safety. It was founded in early 2021, as a public benefit corporation,[4] and focuses on empirical research with advanced ML systems. I see our work as having four key pillars:

Training near-SOTA models. This ensures that our safety work will in fact be relevant to cutting-edge systems, and we’ve found that many alignment techniques only work at large scales.[5] Understanding how capabilities emerge over model scale and training-time seems vital for safety, as a basis to proceed with care or as a source of evidence that continuing to scale capabilities would be immediately risky.

Direct alignment research. There are many proposals for how advanced AI systems might be aligned, many of which can be tested empirically in near-SOTA (but not smaller) models today. We regularly produce the safest model we can with current techniques,[6] and characterize how it fails in order to inform research and policy. With RLHF as a solid baseline and building block, we're investigating more complicated but robust schemes such as constitutional AI, scalable supervision, and model-assisted evaluations.

Interpretability research. Fully understanding models could let us rule out learned optimizers, deceptive misalignment, and more. Even limited insights would be incredibly valuable as an independent check on other alignment efforts, and might offer a second chance if they fail.

Policy and communications. I expect AI capabilities will continue to advance, with fast-growing impacts on employment, the economy, and cybersecurity. Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.

If you want to know more about what we’re up to, the best place to check is anthropic.com for all our published research. We’ll be posting more information about Anthropic throughout this year, as well as fleshing out the website.

Concrete reasons for hope

My views on alignment are similar to (my understanding of) Nate Soares’. I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe. We also seem to have different views on how labs should respond to the situation, which this essay does not discuss.

Language model interventions work pretty well

I wasn’t expecting this, but our helpful/harmless/honest research is in fact going pretty well! The models are far from perfect, but we’ve made far more progress than I would have expected a year ago, and no signs of slowing down yet. HHH omits several vital pieces of the f...

...more