
Sign up to save your podcasts
Or


Subtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places..
A lot of threat models describing how AIs might escape our control (e.g. self-exfiltration, hacking the datacenter) start out with AIs that are acting as agents working autonomously on research tasks (especially AI R&D) in a datacenter controlled by the AI company. So I think it's important to have a clear picture of how this kind of AI agent could work, and how it might be secured. I often talk to people who seem to have a somewhat confused picture of how this kind of agent setup would work that causes them to conflate some different versions of the threat model, and to miss some important points about which aspects of the system are easy or hard to defend.
So in this post, I’ll present a simple system architecture for a [...]
---
Outline:
(03:13) The simplest design
(04:50) Process flow
(08:01) Some things to notice about this design
(09:30) An extended design
(11:29) Different ways the AI could compromise this
(11:34) Model weight exfiltration
(14:05) Rogue internal deployments
(15:21) Using this framework to discuss alleged examples of AIs modifying their own environments
(15:28) Sakana AI
(16:00) o1 agent architecture
(16:36) Sycophancy to subterfuge
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Redwood ResearchSubtitle: And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places..
A lot of threat models describing how AIs might escape our control (e.g. self-exfiltration, hacking the datacenter) start out with AIs that are acting as agents working autonomously on research tasks (especially AI R&D) in a datacenter controlled by the AI company. So I think it's important to have a clear picture of how this kind of AI agent could work, and how it might be secured. I often talk to people who seem to have a somewhat confused picture of how this kind of agent setup would work that causes them to conflate some different versions of the threat model, and to miss some important points about which aspects of the system are easy or hard to defend.
So in this post, I’ll present a simple system architecture for a [...]
---
Outline:
(03:13) The simplest design
(04:50) Process flow
(08:01) Some things to notice about this design
(09:30) An extended design
(11:29) Different ways the AI could compromise this
(11:34) Model weight exfiltration
(14:05) Rogue internal deployments
(15:21) Using this framework to discuss alleged examples of AIs modifying their own environments
(15:28) Sakana AI
(16:00) o1 agent architecture
(16:36) Sycophancy to subterfuge
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.