The Nonlinear Library

LW - Anthropic Observations by Zvi


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic Observations, published by Zvi on July 25, 2023 on LessWrong.
Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one.
Anthropic Hypothesis
The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case 'use the word "paperclip" as many times as possible') in the hopes of showing that RLHF won't be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.
Hubinger agrees, with a caveat. "It's a little tricky because you don't know if you just didn't try hard enough to get deception," he says. Maybe Kaplan is exactly right: Naïve deception gets destroyed in training, but sophisticated deception doesn't. And the only way to know whether an AI can deceive you is to build one that will do its very best to try.
The problem with this approach is that an AI that 'does its best to try' is not doing the best that the future dangerous system will do.
So by this same logic, a test on today's systems can only show your technique doesn't work or that it works for now, it can never give you confidence that your technique will continue to work in the future.
They are running the test because they think that RLHF is so hopeless we can likely already prove, at current optimization levels, that it is doomed to failure.
Also, the best try to deceive you will sometimes be, of course, to make you think that the problem has gone away while you are testing it.
This is the paradox at the heart of Anthropic: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?
But Anthropic also believes strongly that leading on safety can't simply be a matter of theory and white papers - it requires building advanced models on the cutting edge of deep learning. That, in turn, requires lots of money and investment, and it also requires, they think, experiments where you ask a powerful model you've created to deceive you.
"We think that safety research is very, very bottlenecked by being able to do experiments on frontier models," Kaplan says, using a common term for models on the cutting edge of machine learning. To break that bottleneck, you need access to those frontier models. Perhaps you need to build them yourself.
The obvious question arising from Anthropic's mission: Is this type of effort making AI safer than it would be otherwise, nudging us toward a future where we can get the best of AI while avoiding the worst? Or is it only making it more powerful, speeding us toward catastrophe?
If we could safely build and work on things as smart and capable as the very models that we will later need to align, then this approach would make perfect sense. Given we thankfully cannot yet build such models, the 'cutting edge' is importantly below the capabilities of the future dangerous systems. In particular they are on different sides of key thresholds where we would expect current techniques to stop working, such as the AI becoming smarter than those evaluating its outputs.
For Anthropic's thesis to be true, you need to thread a needle. Useful work now must require cutting edge models, and those cutting edge models must be sufficiently advanced to do useful work.
In order to pursue that thesis and also potentially to build an AGI or perhaps transform the world economy, Anthropic plans on raising $5 billion over the next two years. Their investor pitch deck claims that whoever gets out in front of the next wave can generate an unstoppable lead. Most of the funding will go to development of cutting edge frontier models. That certainly seems ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings