
Sign up to save your podcasts
Or
Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!
These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.
Motivation
Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]
---
Outline:
(00:58) Motivation
(01:45) Methods
(03:33) Results
(07:02) Limitations
(07:52) Appendix
(07:55) Hackable illegibility rater prompt
(09:03) Stronger illegibility rater prompt
(09:25) System prompt used for the policy
(09:40) System prompt used for the no-CoT policy
(09:48) Example generations
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard!
These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models.
Motivation
Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...]
---
Outline:
(00:58) Motivation
(01:45) Methods
(03:33) Results
(07:02) Limitations
(07:52) Appendix
(07:55) Hackable illegibility rater prompt
(09:03) Stronger illegibility rater prompt
(09:25) System prompt used for the policy
(09:40) System prompt used for the no-CoT policy
(09:48) Example generations
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,373 Listeners
2,429 Listeners
8,189 Listeners
4,159 Listeners
92 Listeners
1,554 Listeners
9,812 Listeners
88 Listeners
484 Listeners
5,476 Listeners
16,145 Listeners
532 Listeners
133 Listeners
96 Listeners
510 Listeners