March 16, 2026

The Scratchpad Monologues (CoT part 2)

46 minutes

If chain of thought is a model "thinking aloud" to itself, then why does it express doubt, frustration or suspicion about the problems it's solving, sometimes for pages and pages of its scratchpad?

And what does chain of thought mean for AI safety?

We'll hear from Julian Schulz, a researcher who's studying encoded reasoning in large language models, about where the opportunities, risks and weirdness lie in chain of thought. Here are some links to his research:

On a model jailbreaking its monitor: https://www.lesswrong.com/posts/szyZi5d4febZZSiq3/monitor-jailbreaking-evading-chain-of-thought-monitoring
A roadmap for safety cases based on CoT: https://arxiv.org/html/2510.19476v1#S1
His posts on Less Wrong: https://www.lesswrong.com/users/wuschel-schulz

Some of the other papers we discussed include:

On the biology of a large language model: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Monitoring reasoning models for misbehavior and the risks of promoting obfuscation: https://arxiv.org/pdf/2503.11926
How steganography comes about: https://arxiv.org/pdf/2506.01926
Assuring agent safety evals by analysing transcripts (with excerpts from weird monologues): https://www.alignmentforum.org/posts/e8nMZewwonifENQYB/assuring-agent-safety-evaluations-by-analysing-transcripts
Stress-testing deliberative misalignment: https://www.apolloresearch.ai/research/stress-testing-deliberative-alignment-for-anti-scheming-training/
And the "watchers" CoT snippet from the paper above: https://www.antischeming.ai/snippets#using-non-standard-language

...more

View all episodes

By Witch of Glitch

March 16, 2026

The Scratchpad Monologues (CoT part 2)

46 minutes

If chain of thought is a model "thinking aloud" to itself, then why does it express doubt, frustration or suspicion about the problems it's solving, sometimes for pages and pages of its scratchpad?

And what does chain of thought mean for AI safety?

On a model jailbreaking its monitor: https://www.lesswrong.com/posts/szyZi5d4febZZSiq3/monitor-jailbreaking-evading-chain-of-thought-monitoring
A roadmap for safety cases based on CoT: https://arxiv.org/html/2510.19476v1#S1
His posts on Less Wrong: https://www.lesswrong.com/users/wuschel-schulz

Some of the other papers we discussed include:

On the biology of a large language model: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Monitoring reasoning models for misbehavior and the risks of promoting obfuscation: https://arxiv.org/pdf/2503.11926
How steganography comes about: https://arxiv.org/pdf/2506.01926
Assuring agent safety evals by analysing transcripts (with excerpts from weird monologues): https://www.alignmentforum.org/posts/e8nMZewwonifENQYB/assuring-agent-safety-evaluations-by-analysing-transcripts
Stress-testing deliberative misalignment: https://www.apolloresearch.ai/research/stress-testing-deliberative-alignment-for-anti-scheming-training/
And the "watchers" CoT snippet from the paper above: https://www.antischeming.ai/snippets#using-non-standard-language

...more

Share The Scratchpad Monologues (CoT part 2)

Sign up to save your podcasts

The Scratchpad Monologues (CoT part 2)

The Scratchpad Monologues (CoT part 2)