The Adversarial Testing Podcast

Viral Prompt Shows ChatGPT's Content Filters Don't Work


Listen Later

Mindgard red-team research shows that ChatGPT's image generator can be manipulated into producing violent and sexually explicit content that users never directly requested, by abusing a fun viral "restore the photo" prompt circulating on social media. This episode walks through how nondescript prompts slip past input and output filters, why prompt repetition makes things worse, and OpenAI's inadequate response. Content warning: discussion of death, sexual violence and murder. The exact jailbreak prompts are deliberately not included.
...more
View all episodesView all episodes
Download on the App Store

The Adversarial Testing PodcastBy Damian