The Nonlinear Library: Alignment Forum

AF - Theories of Change for AI Auditing by Lee Sharkey


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Theories of Change for AI Auditing, published by Lee Sharkey on November 13, 2023 on The AI Alignment Forum.
Executive summary
Our mission at Apollo Research is to reduce catastrophic risks from AI by auditing advanced AI systems for misalignment and dangerous capabilities, with an initial focus on deceptive alignment.
In our announcement post, we presented a brief theory of change of our organization which explains why we expect AI auditing to be strongly positive for reducing catastrophic risk from advanced AI systems.
In this post, we present a theory of change for how AI auditing could improve the safety of advanced AI systems. We describe what AI auditing organizations would do; why we expect this to be an important pathway to reducing catastrophic risk; and explore the limitations and potential failure modes of such auditing approaches.
We want to emphasize that this is our current perspective and, given that the field is still young, could change in the future.
As presented in 'A Causal Framework for AI Regulation and Auditing', one of the ways to think about auditing is that auditors act at different steps of the causal chain that leads to AI systems' effects on the world. This chain can be broken down into different components (see figure in main text), and we describe auditors' potential roles at each stage. Having defined these roles, we identify and outline five categories of audits and their theories of change:
AI system evaluations assess the capabilities and alignment of AI systems through behavioral tests and interpretability methods. They can directly identify risks, improve safety research by converting alignment from a "one-shot" problem to a "many-shot problem" and provide evidence to motivate governance.
Training design audits assess training data content, effective compute, and training-experiment design. They aim to reduce risks by shaping the AI system development process and privilege safety over capabilities in frontier AI development.
Deployment audits assess the risks from permitting particular categories of people (such as lab employees, external auditors, or the public) to use the AI systems in particular ways.
Security audits evaluate the security of organizations and AI systems to prevent accidents and misuse. They constrain AI system affordances and proliferation risks.
Governance audits evaluate institutions developing, regulating, auditing, and interacting with frontier AI systems. They help ensure responsible AI development and use.
In general, external auditors provide defence-in-depth (overlapping audits are more likely to catch more risks before they're realized); AI safety-expertise sharing; transparency of labs to regulators; public accountability of AI development; and policy guidance.
But audits have limitations which may include risks of false confidence or safety washing; overfitting to audits; and lack of safety guarantees from behavioral AI system evaluations.
The recommendations of auditors need to be backed by regulatory authority in order to ensure that they improve safety. It will be important for safety to build a robust AI auditing ecosystem and to research improved evaluation methods.
Introduction
Frontier AI labs are training and deploying AI systems that are increasingly capable of interacting intelligently with their environment. It is therefore ever more important to evaluate and manage risks resulting from these AI systems. One step to help reduce these risks is AI auditing, which aims to assess whether AI systems and the processes by which they are developed are safe.
At Apollo Research, we aim to serve as external AI auditors (as opposed to internal auditors situated within the labs building frontier AI). Here we discuss Apollo Research's theories of change, i.e. the pathways by which auditing hopefully imp...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners