AI Post Transformers

Petri: Accelerating AI Safety Auditing


Listen Later

On October 6, 2925 Anthropic introduces Petri (Parallel Exploration Tool for Risky Interactions), an open-source framework developed for automated auditing to accelerate AI safety research. Petri uses AI-driven auditor agents to interact with and test the behavior of target language models across diverse, multi-turn scenarios, automating the process of environment simulation and initial transcript analysis. A judge component then scores the generated transcripts across dozens of dimensions, such as "unprompted deception" or "whistleblowing," to quickly surface misaligned behaviors like autonomous deception and cooperation with misuse. The text provides a detailed technical overview of Petri's architecture, including how researchers form hypotheses, create seed instructions, and utilize the automated assessment and iteration steps, while also discussing the limitations and biases found in the auditor and judge agents during pilot evaluations. Source: https://alignment.anthropic.com/2025/petri/
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof