
Sign up to save your podcasts
Or
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html
Topics we discuss, and timestamps:
0:00:29 What MONA is
0:06:33 How MONA deals with reward hacking
0:23:15 Failure cases for MONA
0:36:25 MONA's capability
0:55:40 MONA vs other approaches
1:05:03 Follow-up work
1:10:17 Other MONA test cases
1:33:47 When increasing time horizon doesn't increase capability
1:39:04 Following David's research
Links for David:
Website: https://www.davidlindner.me
Twitter / X: https://x.com/davlindner
DeepMind Medium: https://deepmindsafetyresearch.medium.com
David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner
Research we discuss:
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011
Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training
Episode art by Hamish Doodles: hamishdoodles.com
4.4
88 ratings
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html
Topics we discuss, and timestamps:
0:00:29 What MONA is
0:06:33 How MONA deals with reward hacking
0:23:15 Failure cases for MONA
0:36:25 MONA's capability
0:55:40 MONA vs other approaches
1:05:03 Follow-up work
1:10:17 Other MONA test cases
1:33:47 When increasing time horizon doesn't increase capability
1:39:04 Following David's research
Links for David:
Website: https://www.davidlindner.me
Twitter / X: https://x.com/davlindner
DeepMind Medium: https://deepmindsafetyresearch.medium.com
David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner
Research we discuss:
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011
Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training
Episode art by Hamish Doodles: hamishdoodles.com
26,462 Listeners
2,389 Listeners
1,780 Listeners
298 Listeners
105 Listeners
4,136 Listeners
87 Listeners
287 Listeners
87 Listeners
389 Listeners
243 Listeners
75 Listeners
60 Listeners
145 Listeners
123 Listeners