AI: post transformers

Petri: Accelerating AI Safety Auditing


Listen Later

On October 6, 2925 Anthropic introduces **Petri (Parallel Exploration Tool for Risky Interactions)**, an open-source framework developed for automated auditing to accelerate AI safety research. Petri uses **AI-driven auditor agents** to interact with and test the behavior of target language models across diverse, multi-turn scenarios, automating the process of environment simulation and initial transcript analysis. A **judge component** then scores the generated transcripts across dozens of dimensions, such as "unprompted deception" or "whistleblowing," to quickly surface **misaligned behaviors** like autonomous deception and cooperation with misuse. The text provides a detailed technical overview of Petri's architecture, including how researchers form hypotheses, create seed instructions, and utilize the automated assessment and iteration steps, while also discussing the **limitations and biases** found in the auditor and judge agents during pilot evaluations.


Source:

https://alignment.anthropic.com/2025/petri/

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof