
Sign up to save your podcasts
Or


A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
Note that this does not have to be, and usually isn’t, something sinister.
It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.
The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]
---
Outline:
(01:03) What They Found
(06:54) Reward Hacking
(09:28) More Training Did Not Help Much
(11:49) This Was Not Even Intentional In the Central Sense
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongA new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
Note that this does not have to be, and usually isn’t, something sinister.
It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.
The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]
---
Outline:
(01:03) What They Found
(06:54) Reward Hacking
(09:28) More Training Did Not Help Much
(11:49) This Was Not Even Intentional In the Central Sense
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,342 Listeners

2,452 Listeners

8,508 Listeners

4,176 Listeners

95 Listeners

1,604 Listeners

9,925 Listeners

96 Listeners

525 Listeners

5,512 Listeners

15,863 Listeners

554 Listeners

132 Listeners

93 Listeners

466 Listeners