LessWrong posts by zvi

“Anthropic Observations” by Zvi


Listen Later

Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one.

Anthropic Hypothesis

The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case ‘use the word “paperclip” as many times as possible’) in the hopes of showing that RLHF won’t be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.

Hubinger agrees, with a caveat. “It’s a little tricky because you don’t know if you just didn’t try hard enough to get deception,” he says. Maybe Kaplan is exactly right: Naïve deception gets destroyed in training, but sophisticated deception doesn’t. And the only way to know whether an AI can deceive you is to build one that will do its very best to try.

The problem with this approach is that an AI that ‘does its best to try’ is not doing the best that the future dangerous system will do.

So by this same logic, a test on today’s systems can only show your technique [...]

---

First published:

July 25th, 2023

Source:

https://www.lesswrong.com/posts/b3CA5Gst5iWkyuY9L/anthropic-observations

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong posts by zviBy zvi