In this special christmas episode, we delve into "Best-of-N Jailbreaking," a powerful new black-box algorithm that demonstrates the vulnerabilities of cutting-edge AI systems. This approach works by sampling numerous augmented prompts - like shuffled or capitalized text - until a harmful response is elicited.
Discover how Best-of-N (BoN) Jailbreaking achieves:
89% Attack Success Rates (ASR) on GPT-4o and 78% ASR on Claude 3.5 Sonnet with 10,000 prompts.
Success in bypassing advanced defenses on both closed-source and open-source models.
Cross-modality attacks on vision, audio, and multimodal AI systems like GPT-4o and Gemini 1.5 Pro.We’ll also explore how BoN Jailbreaking scales with the number of prompt samples, following a power-law relationship, and how combining BoN with other techniques amplifies its effectiveness. This episode unpacks the implications of these findings for AI security and resilience.
Paper: Hughes, John, et al. "Best-of-N Jailbreaking." (2024). arXiv.
Disclaimer: This podcast summary was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research preprint for a comprehensive understanding of the study and its findings.