AI Safety - Paper Digest

Anthropic's Best-of-N: Cracking Frontier AI Across Modalities


Listen Later

In this special christmas episode, we delve into "Best-of-N Jailbreaking," a powerful new black-box algorithm that demonstrates the vulnerabilities of cutting-edge AI systems. This approach works by sampling numerous augmented prompts - like shuffled or capitalized text - until a harmful response is elicited.

Discover how Best-of-N (BoN) Jailbreaking achieves:

  • 89% Attack Success Rates (ASR) on GPT-4o and 78% ASR on Claude 3.5 Sonnet with 10,000 prompts.
  • Success in bypassing advanced defenses on both closed-source and open-source models.
  • Cross-modality attacks on vision, audio, and multimodal AI systems like GPT-4o and Gemini 1.5 Pro.

  • We’ll also explore how BoN Jailbreaking scales with the number of prompt samples, following a power-law relationship, and how combining BoN with other techniques amplifies its effectiveness. This episode unpacks the implications of these findings for AI security and resilience.

    Paper: Hughes, John, et al. "Best-of-N Jailbreaking." (2024). arXiv.

    Disclaimer: This podcast summary was generated using Google's NotebookLM AI. While the summary aims to provide an overview, it is recommended to refer to the original research preprint for a comprehensive understanding of the study and its findings.

    ...more
    View all episodesView all episodes
    Download on the App Store

    AI Safety - Paper DigestBy Arian Abbasi, Alan Aqrawi