Redwood Research Blog

By Redwood Research

Narrations of Redwood Research blog posts.

Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques... more

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Redwood Research Blog:

How many episodes does Redwood Research Blog have?

The podcast currently has 61 episodes available.

Redwood Research Blog episodes:

April 29, 2025 “7+ tractable directions in AI control” by Ryan Greenblatt
Subtitle: A list of easy-to-start directions in AI control targeted at independent researchers without as much context or compute.

In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We include directions which aren’t as centrally AI control and which also have other motivations.) This list is targeted at independent researchers who don't necessarily have a ton of context on AI control (but have at least some context via e.g., having read some of our posts on control) and who don't want to do infrastructure and compute heavy experiments. In the final section, we also list 3 additional directions that are probably a bit more challenging but still tractable.
We’ll mostly be highlighting and expanding upon areas that Ryan discussed in an earlier post: An overview of areas of control work.
We don’t discuss [...]

---
Outline:
(01:29) Techniques for doing elicitation without learning and studying overelicitation
(03:35) Generating indistinguishable synthetic inputs
(07:09) Teaching models synthetic facts
(10:38) Further studying exploration hacking
(14:00) Malign AI agent substitution
(15:02) Studying data poisoning
(16:55) Training attack policies
(18:58) Bonus: 3 slightly more challenging (but important and tractable) directions
(19:05) Few-shot catastrophe prevention
(20:48) Studying collusion and developing countermeasures
(22:53) Studying steganography and developing countermeasures
The original text contained 1 footnote which was omitted from this narration.
---

First published:
April 29th, 2025

Source:
https://redwoodresearch.substack.com/p/7-tractable-directions-in-ai-control

---

Narrated by TYPE III AUDIO.
...more
27min
April 25, 2025 “Clarifying AI R&D threat models” by Josh Clymer
Subtitle: (There are a few).

A casual reader of one of the many AI company safety frameworks might be confused about why “AI R&D” is listed as a “threat model.” They might be even further mystified to find out that some people believe risks from automated AI R&D are the most severe and urgent risks. What's so concerning about AI writing code?
A few months ago, I was frustrated by the lack of a thorough published answer to this question. So when the Safe AI Forum (the organizers of IDAIS) reached out to me about a workshop convening international experts from China, the US, Canada and Europe to discuss these threat models, I was eager to participate.
I ended up being a rather enthusiastic attendee – pulling an all-nighter to throw together the beginnings of a paper that became the output of the workshop. The next day, my [...]

---
Outline:
(02:06) Autonomous AI R&D is dangerous for two reasons that are often conflated
(02:52) Risks from AI R&D automation
(05:04) Risks from AI R&D acceleration
(07:53) Bare minimum recommendations for managing AI R&D risks
(09:06) Conclusion
---

First published:
April 25th, 2025

Source:
https://redwoodresearch.substack.com/p/clarifying-ai-r-and-d-threat-models

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
...more
10min
April 24, 2025“How training-gamers might function (and win)” by Vivek Hebbar
Subtitle: A model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents..

In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), models which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers.
A key point is that training gamers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a training-gamer can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty.
Core claims

To get high reward, models must learn context-specific heuristics which are not derivable by reasoning from the goal of reward maximization.1 It is also true that thinking explicitly [...]

---
Outline:
(01:02) Core claims
(03:10) Characterizing training gamers and proxy-aligned models
(07:27) What about the case of humans and evolution? Wasn't it implausible to get an explicit fitness maximizer?
(08:25) What about the arguments in Reward is not the optimization target?
(09:24) What about aligned models?
(11:26) When will models think about reward?
(14:54) What I expect schemers to look like
(17:46) What will terminal reward seekers do off-distribution?
(18:29) What factors affect the likelihood of training-gaming?
(20:29) What about CoT models?
(21:02) Idealized relationship of subgoals to their superior goals
(22:50) A story about goal reflection
(25:28) Thoughts on compression
(28:19) Appendix: Distribution over worlds
(31:58) Canary string
(32:14) Acknowledgements
---
First published:
April 24th, 2025

Source:
https://redwoodresearch.substack.com/p/how-training-gamers-might-function

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
...more
33min
April 18, 2025“Handling schemers if shutdown is not an option” by Buck Shlegeris
Subtitle: What if getting strong evidence of scheming isn't the end of your scheming problems, but merely the middle?.

In most of our research and writing on AI control, we’ve emphasized the following situation:

The AI developer is deploying a model that they think might be scheming, but they aren’t sure.

The objective of the safety team is to ensure that if the model is scheming, it will be caught before successfully causing a catastrophe.
For example, in our original paper on AI control, the methodology we used for evaluating safety protocols implies that once the developer is confident that the model is a schemer, they’re totally fine with undeploying the untrusted model.1 2
But that seems kind of unlikely. Even if AI companies find strong evidence that their AIs are trying to escape, they won’t necessarily stop deploying them. This might be justified: even [...]

---
Outline:
(03:29) Behavioral observations vs other ways of detecting scheming
(08:43) What does control look like if you need to keep deploying schemers?
(15:46) How does the prospect of needing to deploy known schemers affect the relative value of control research?
(17:31) What research should be done that focuses specifically on this?
(17:36) Sample-efficient ML
(18:02) Training recalcitrant policies
(20:37) A few other historical reasons for focusing on before-you-catch
---
First published:
April 18th, 2025

Source:
https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
...more
25min
April 16, 2025 “Ctrl-Z: Controlling AI Agents via Resampling” by Buck Shlegeris
Subtitle: A new paper on AI control for agents..

We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.
Here's the abstract of the paper:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system [...]

---

First published:
April 16th, 2025

Source:
https://redwoodresearch.substack.com/p/ctrl-z-controlling-ai-agents-via

---

Narrated by TYPE III AUDIO.

---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
...more
4min
April 15, 2025 “To be legible, evidence of misalignment probably has to be behavioral” by Ryan Greenblatt
Subtitle: Evidence from just model internals (e.g. interpretability) is unlikely to be broadly convincing..

One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).
You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic1. That is, you might hope that internals-based methods allow us to legibly incriminate [...]

---

First published:
April 15th, 2025

Source:
https://redwoodresearch.substack.com/p/to-be-legible-evidence-of-misalignment

---

Narrated by TYPE III AUDIO.
...more
6min
April 11, 2025“Why do misalignment risks increase as AIs get more capable?” by Ryan Greenblatt
Subtitle: A breakdown of how higher capabilities increase risk.

It's generally agreed that as AIs get more capable, risks from misalignment increase. But there are a few different mechanisms by which more capable models are riskier, and distinguishing between those mechanisms is important when estimating the misalignment risk posed at a particular level of capabilities or by a particular model.
There are broadly 3 reasons why misalignment risks increase with capabilities:

As AIs get more capable, the chance of egregious misalignment increases: AIs with more powerful opaque reasoning abilities are more likely to become egregiously misaligned—sufficiently misaligned that they strategically deceive and work against their developers—without this being detected or trained out. We should expect opaque reasoning abilities generally increase as models get more capable, though it isn't entirely obvious what the relationship between opaque and non-opaque cognition will be in future AI systems. Additionally, I [...]

---
First published:
April 11th, 2025

Source:
https://redwoodresearch.substack.com/p/why-do-misalignment-risks-increase

---

Narrated by TYPE III AUDIO.
...more
8min
April 09, 2025“An overview of areas of control work” by Ryan Greenblatt
Subtitle: What are all the research and implementation areas helpful for control?.

In this post, I'll list all the areas of control research (and implementation) that seem promising to me.
This references framings and abstractions discussed in Prioritizing threats for AI control.
First, here is a division of different areas, though note that these areas have some overlaps:

Developing and using settings for control evaluations (including technique iteration)

Better understanding control-relevant capabilities and model properties

Developing specific countermeasures/techniques in isolation (rather than via iterating in a setting)

Doing control experiments on actual current AI usage

Prototyping or implementing software infrastructure for control and security measures that are particularly relevant to control

Prototyping or implementing human processes for control

Conceptual research and other ways of researching control other than empirical experiments

Nearer-term applications of control to practice and accelerate later [...]

---
Outline:
(01:58) Developing and using settings for control evaluations
(14:33) Better understanding control-relevant capabilities and model properties
(29:22) Developing specific countermeasures/techniques in isolation
(37:06) Doing control experiments on actual AI usage
(41:56) Prototyping or implementing software infrastructure for control and security measures that are particularly relevant to control
(43:13) Prototyping or implementing human processes for control
(48:10) Conceptual research and other ways of researching control other than empirical experiments
(51:01) Nearer-term applications of control to practice and accelerate later usage of control
---
First published:
April 9th, 2025

Source:
https://redwoodresearch.substack.com/p/an-overview-of-areas-of-control-work

---

Narrated by TYPE III AUDIO.
...more
55min
April 06, 2025 “An overview of control measures” by Ryan Greenblatt
Subtitle: What methods can we use to ensure control?.

We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main countermeasures (building on my prior discussion of the main threats). I'll focus my discussion on countermeasures for reducing existential risk from misalignment, though some of these countermeasures will help with other problems such as human power grabs.
I'll overview control measures that seem important to me, including software infrastructure, security measures, and other control measures (which often involve usage of AI or human labor). In addition to the core control measures (which are focused on monitoring, auditing, and tracking AIs and on blocking and/or resampling suspicious outputs), there is a long tail of potentially [...]

---
Outline:
(01:19) Background and categorization
(06:11) Non-ML software infrastructure
(07:11) The most important software infrastructure
(11:58) Less important software infrastructure
(19:38) AI and human components
(21:30) Core measures
(27:49) Other less-core measures
(32:45) More speculative measures
(41:30) Specialized countermeasures
(46:05) Other directions and what Im not discussing
(47:15) Security improvements
---

First published:
April 6th, 2025

Source:
https://redwoodresearch.substack.com/p/an-overview-of-control-measures

---

Narrated by TYPE III AUDIO.
...more
51min
April 05, 2025 “Buck on the 80,000 Hours podcast” by Buck Shlegeris
Buck on the 80,000 Hours podcast
My podcast with Rob Wiblin from 80,000 Hours just came out. I’m really happy with how it turned out. I talked about a bunch of stuff on the podcast that I don’t think we’ve written up before.

Transcript + links + summary here. You can also get it as a podcast:
Spotify:
Apple:
Subscribe to Redwood Research blog
Launched a year ago
We research catastrophic AI risks and techniques that could be used to mitigate them.
Discussion about this post

---
Outline:
(00:09) Buck on the 80,000 Hours podcast
(00:34) Subscribe to Redwood Research blog
(00:57) Discussion about this post
---

First published:
April 5th, 2025

Source:
https://redwoodresearch.substack.com/p/buck-on-the-80000-hours-podcast

---

Narrated by TYPE III AUDIO.
...more
2min

FAQs about Redwood Research Blog:

How many episodes does Redwood Research Blog have?

The podcast currently has 61 episodes available.