
Sign up to save your podcasts
Or


Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!
These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.
Motivation
Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]
---
Outline:
(00:58) Motivation
(01:45) Methods
(03:33) Results
(07:02) Limitations
(07:52) Appendix
(07:55) Hackable illegibility rater prompt
(09:03) Stronger illegibility rater prompt
(09:25) System prompt used for the policy
(09:40) System prompt used for the no-CoT policy
(09:48) Example generations
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!
These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.
Motivation
Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]
---
Outline:
(00:58) Motivation
(01:45) Methods
(03:33) Results
(07:02) Limitations
(07:52) Appendix
(07:55) Hackable illegibility rater prompt
(09:03) Stronger illegibility rater prompt
(09:25) System prompt used for the policy
(09:40) System prompt used for the no-CoT policy
(09:48) Example generations
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,320 Listeners

2,451 Listeners

8,549 Listeners

4,178 Listeners

93 Listeners

1,601 Listeners

9,922 Listeners

95 Listeners

512 Listeners

5,510 Listeners

15,930 Listeners

547 Listeners

130 Listeners

93 Listeners

467 Listeners