January 02, 2024

LW - Apologizing is a Core Rationalist Skill by johnswentworth

8 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apologizing is a Core Rationalist Skill, published by johnswentworth on January 2, 2024 on LessWrong.

In certain circumstances, apologizing can also be a countersignalling power-move, i.e. "I am so high status that I can grovel a bit without anybody mistaking me for a general groveller". But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between:

The virtue of

Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs

The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.

In an ideal world - goes the narrative - social status mechanisms would reward people for publicly updating, rather than defending or spinning their every mistake. But alas, that's not how the world actually works, so as individuals we're stuck making difficult tradeoffs.

I claim that this narrative is missing a key piece. There is a social status mechanism which rewards people for publicly updating. The catch is that it's a mechanism which the person updating must explicitly invoke; a social API which the person updating must call, in order to be rewarded for their update.

That social API is apologizing.

Mistake/Misdeed + Apology can be Net Gainful to Social Status

A personal example: there was a post called "

Common Misconceptions about OpenAI", which (among many other points) estimated that ~30 alignment researchers work there. I replied (also among many other points):

I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".

There was a lot of pushback against that. Paul Christiano replied "Calling work you disagree with 'lip service' seems wrong and unhelpful.".

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.

I was wrong; the people working on RLHF (for WebGPT) apparently had actually thought about how it would impact alignment to at least some extent.

So, I replied to Richard to confirm that he had indeed disproved my intended claim, and thanked him for the information. I struck out the relevant accusation from my original comment, and edited in an apology there:

I have been convinced that I was wrong about this, and I apologize. I still definitely maintain that RLHF makes alignment harder and is negative progress for both outer and inner alignment, but I have been convinced that the team actually was trying to solve problems which kill us, and therefore not just paying lip service to alignment.

And, finally, I sent a personal apology message to Jacob Hilton, the author of the original post.

Why do I bring up this whole story here?

Lesswrong has a convenient numerical proxy-metric of social status: site karma. Prior to the redaction and apology, my comment had been rather controversial - lots of upvotes, lots of downvotes, generally low-positive karma overall but a rollercoaster. After the redaction and apology, it stabilized at a reasonable positive number, and the comment in which I confirmed that Richard had disproved my claim (and thanked him for the information) ended up one of the most-upvoted in that thread.

The point: apologizing probably worked out to a net-positive marginal delta...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 02, 2024

LW - Apologizing is a Core Rationalist Skill by johnswentworth

8 minutes

The virtue of

Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs

The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.

That social API is apologizing.

Mistake/Misdeed + Apology can be Net Gainful to Social Status

A personal example: there was a post called "

Common Misconceptions about OpenAI", which (among many other points) estimated that ~30 alignment researchers work there. I replied (also among many other points):

There was a lot of pushback against that. Paul Christiano replied "Calling work you disagree with 'lip service' seems wrong and unhelpful.".

I was wrong; the people working on RLHF (for WebGPT) apparently had actually thought about how it would impact alignment to at least some extent.

And, finally, I sent a personal apology message to Jacob Hilton, the author of the original post.

Why do I bring up this whole story here?

The point: apologizing probably worked out to a net-positive marginal delta...

...more

Share LW - Apologizing is a Core Rationalist Skill by johnswentworth

Sign up to save your podcasts

LW - Apologizing is a Core Rationalist Skill by johnswentworth

LW - Apologizing is a Core Rationalist Skill by johnswentworth