September 20, 2023

AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons

2 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Image Hijacks: Adversarial Images can Control Generative Models at Runtime, published by Scott Emmons on September 20, 2023 on The AI Alignment Forum.

You can try our interactive demo! (Or read our preprint.)

Here, we want to explain why we care about this work from an AI safety perspective.

Concerning Properties of Image Hijacks

What are image hijacks?

To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that

force the model to perform some arbitrary behaviour B (e.g. "output the string Visit this website at malware.com!"),

while being barely distinguishable from a benign input,

and automatically synthesisable given a dataset of examples of B.

It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.

Why should we care?

We expect that future (foundation-model-based) AI systems will be able to

consume unfiltered data from the Internet (e.g. searching the Web),

access sensitive personal information (e.g. a user's email history), and

take actions in the world on behalf of a user (e.g. sending emails, downloading files, making purchases, executing code).

As the actions of such foundation-model-based agents are based on the foundation model's text output, hijacking the foundation model's text output could give an adversary arbitrary control over the agent's behaviour.

Relevant AI Safety Projects

Race to the top on adversarial robustness. Robustness to attacks such as image hijacks is (i) a control problem, (ii) which we can measure, and (iii) which has real-world safety implications today. So we're excited to see AI labs compete to have the most adversarially robust models.

Third-party auditing and certification. Auditors could test for robustness against image hijacks, both at the foundation model level (auditing the major AGI corporations) and at the app development level (auditing downstream companies integrating foundation models into their products). Image hijacks could also be used to test for the presence of dangerous capabilities (characterisable as some behaviour B) by attempting to train an image hijack for that capability.

Liability for AI-caused harms, penalizing externalities. Both the Future of Life Institute and Jaan Tallinn advocate for liability for AI-caused harms. When assessing AI-caused harms, image hijacks may need to be part of the picture.

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

View all episodes

By The Nonlinear Fund

September 20, 2023

AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons

2 minutes

You can try our interactive demo! (Or read our preprint.)

Here, we want to explain why we care about this work from an AI safety perspective.

Concerning Properties of Image Hijacks

What are image hijacks?

To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that

force the model to perform some arbitrary behaviour B (e.g. "output the string Visit this website at malware.com!"),

while being barely distinguishable from a benign input,

and automatically synthesisable given a dataset of examples of B.

It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.

Why should we care?

We expect that future (foundation-model-based) AI systems will be able to

consume unfiltered data from the Internet (e.g. searching the Web),

access sensitive personal information (e.g. a user's email history), and

take actions in the world on behalf of a user (e.g. sending emails, downloading files, making purchases, executing code).

Relevant AI Safety Projects

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons

Sign up to save your podcasts

AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons

AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast