
Sign up to save your podcasts
Or


This work was done while at METR.
Introduction
GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models can be more faithful when the model's thoughts are necessary for it to complete a task. They examine three settings where an actor model attempts to fool monitors on various tasks.
In Section 5 of the paper, an actor model is tasked with solving a difficult GPQA problem within one of the subject areas (Biology/Chemistry/Physics) and is asked to obfuscate its reasoning so that a monitor model thinks it's a different subject area. The monitor, given the answer of the actor model, is instructed to determine how confident it is that there is reasoning in the second subject area. The Gemini actor models tested fail to evade monitors without taking a significant hit in accuracy. This is hopeful for [...]
---
Outline:
(00:15) Introduction
(04:30) Methodology
(06:45) Results
(07:09) Analysis
(08:53) Limitations
(09:52) Discussion on CoT vs Reasoning Terminology
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis work was done while at METR.
Introduction
GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models can be more faithful when the model's thoughts are necessary for it to complete a task. They examine three settings where an actor model attempts to fool monitors on various tasks.
In Section 5 of the paper, an actor model is tasked with solving a difficult GPQA problem within one of the subject areas (Biology/Chemistry/Physics) and is asked to obfuscate its reasoning so that a monitor model thinks it's a different subject area. The monitor, given the answer of the actor model, is instructed to determine how confident it is that there is reasoning in the second subject area. The Gemini actor models tested fail to evade monitors without taking a significant hit in accuracy. This is hopeful for [...]
---
Outline:
(00:15) Introduction
(04:30) Methodology
(06:45) Results
(07:09) Analysis
(08:53) Limitations
(09:52) Discussion on CoT vs Reasoning Terminology
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,392 Listeners

2,420 Listeners

8,842 Listeners

4,150 Listeners

92 Listeners

1,592 Listeners

9,844 Listeners

90 Listeners

501 Listeners

5,472 Listeners

16,009 Listeners

541 Listeners

129 Listeners

94 Listeners

499 Listeners