December 25, 2024

Anthropic's Best-of-N: Cracking Frontier AI Across Modalities

12 minutes

In this special christmas episode, we delve into "Best-of-N Jailbreaking," a powerful new black-box algorithm that demonstrates the vulnerabilities of cutting-edge AI systems. This approach works by sampling numerous augmented prompts - like shuffled or capitalized text - until a harmful response is elicited.

Discover how Best-of-N (BoN) Jailbreaking achieves:

89% Attack Success Rates (ASR) on GPT-4o and 78% ASR on Claude 3.5 Sonnet with 10,000 prompts.

Success in bypassing advanced defenses on both closed-source and open-source models.

Cross-modality attacks on vision, audio, and multimodal AI systems like GPT-4o and Gemini 1.5 Pro.

We’ll also explore how BoN Jailbreaking scales with the number of prompt samples, following a power-law relationship, and how combining BoN with other techniques amplifies its effectiveness. This episode unpacks the implications of these findings for AI security and resilience.

Paper: Hughes, John, et al. "Best-of-N Jailbreaking." (2024). arXiv.

Disclaimer: This podcast summary was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research preprint for a comprehensive understanding of the study and its findings.

...more