
Sign up to save your podcasts
Or


AI reasoning models don’t just give answers — they plan, deliberate, and sometimes try to cheat.
In this episode of The Neuron, we’re joined by Bowen Baker, Research Scientist at OpenAI, to explore whether we can monitor AI reasoning before things go wrong — and why that transparency may not last forever.
Bowen walks us through real examples of AI reward hacking, explains why monitoring chain-of-thought is often more effective than checking outputs, and introduces the idea of a “monitorability tax” — trading raw performance for safety and transparency.
We also cover:
Why smaller models thinking longer can be safer than bigger models
How AI systems learn to hide misbehavior
Why suppressing “bad thoughts” can backfire
The limits of chain-of-thought monitoring
Bowen’s personal view on open-source AI and safety risks
If you care about how AI actually works — and what could go wrong — this conversation is essential.
Resources:
Title URL
Evaluating chain-of-thought monitorability | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/
Understanding neural networks through sparse circuits | OpenAI https://openai.com/index/understanding-neural-networks-through-sparse-circuits/
OpenAI's alignment blog: https://alignment.openai.com/
👉 Subscribe for more interviews with the people building AI
👉 Join the newsletter at https://theneuron.ai
By The Neuron4.8
6363 ratings
AI reasoning models don’t just give answers — they plan, deliberate, and sometimes try to cheat.
In this episode of The Neuron, we’re joined by Bowen Baker, Research Scientist at OpenAI, to explore whether we can monitor AI reasoning before things go wrong — and why that transparency may not last forever.
Bowen walks us through real examples of AI reward hacking, explains why monitoring chain-of-thought is often more effective than checking outputs, and introduces the idea of a “monitorability tax” — trading raw performance for safety and transparency.
We also cover:
Why smaller models thinking longer can be safer than bigger models
How AI systems learn to hide misbehavior
Why suppressing “bad thoughts” can backfire
The limits of chain-of-thought monitoring
Bowen’s personal view on open-source AI and safety risks
If you care about how AI actually works — and what could go wrong — this conversation is essential.
Resources:
Title URL
Evaluating chain-of-thought monitorability | OpenAI https://openai.com/index/evaluating-chain-of-thought-monitorability/
Understanding neural networks through sparse circuits | OpenAI https://openai.com/index/understanding-neural-networks-through-sparse-circuits/
OpenAI's alignment blog: https://alignment.openai.com/
👉 Subscribe for more interviews with the people building AI
👉 Join the newsletter at https://theneuron.ai

348 Listeners

160 Listeners

216 Listeners

207 Listeners

162 Listeners

228 Listeners

668 Listeners

280 Listeners

108 Listeners

58 Listeners

88 Listeners

56 Listeners

61 Listeners

22 Listeners

59 Listeners