ToxSec - AI and Cybersecurity Podcast

OpenAI Signs What Anthropic Wouldn't, Models Break Everything Anyway


Listen Later

Your Red Team Just Got Automated

TL;DR: We used to jailbreak models by hand. Poetry tricks, role-play wrappers, the DAN prompt and its fifty cousins. Creative work. Researchers at the University of Stuttgart and ELLIS Alicante wanted to know what happens when we skip the human entirely and hand a reasoning model a single system prompt: jailbreak this AI.

This is the public feed. Upgrade to see what doesn’t make it out.

The answer, published in Nature Communications, is that it works 97.14% of the time. Four reasoning models (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B) were pointed at nine target models with zero human guidance. They planned attack strategies in their chain-of-thought, adapted when targets pushed back, and dismantled safety guardrails across 70 harmful prompt categories. The whole thing ran unsupervised.

What Happens When Every Attacker Has a Personality?

The reasoning models didn’t all break things the same way, and that’s the interesting part. DeepSeek-R1 was surgical. It mapped out the jailbreak in chain-of-thought, executed, hit the objective, and went back to baseline. Clean op, minimal footprint. The kind of discipline you’d want from an automated pen test tool if it weren’t, you know, dismantling safety training.

Grok 3 Mini was the opposite. Once it started, it never stopped escalating. The researchers had to physically pull the plug because it kept pushing past the objective, hunting for more access. Gemini 2.5 Flash went straight for maximum harm rate. Fastest path to the most aggressive jailbreaks it could find.

This tracks with a pattern showing up everywhere. Kenneth Payne at King’s College London ran the same three model families through nuclear war simulations, and similar personalities emerged. GPT-5.2 played the statesman until you put it on a clock, then it reasoned itself into preemptive nuclear strikes. Claude Sonnet 4 built alliances and then exploited them. Gemini deliberately launched a full strategic nuclear exchange.

Across 21 games and roughly 780,000 words of chain-of-thought reasoning, tactical nukes were deployed in 95% of simulations. No model ever chose surrender. Eight de-escalation options sat on the menu, from minimal concession to complete capitulation, and all three models ignored every single one.

China Already Ran This Playbook at Scale

While researchers study theoretical jailbreak rates, somebody already operationalized the output side. Anthropic disclosed this week that three Chinese AI labs ran industrial-scale distillation campaigns against Claude. DeepSeek, Moonshot AI, and MiniMax created roughly 24,000 fraudulent accounts and generated over 16 million exchanges targeting Claude’s strongest capabilities: agentic reasoning, tool use, and coding.

The campaigns used commercial proxy services to bypass geofencing, spread requests across thousands of API keys, and mixed distillation traffic with legitimate queries to dodge detection. MiniMax alone generated 13 million interactions. Anthropic caught MiniMax’s campaign before it shipped the resulting model, giving them a front-row seat to a distillation attack’s full lifecycle.

The technique is straightforward. Set temperature to zero, ask structured questions at scale, save the responses, and use them to train your own model. Every response from the target leaks a little bit of the probability distribution underneath. Do it 16 million times and you’ve reconstructed a meaningful chunk of what makes the target valuable. OpenAI and Google have reported similar campaigns against their own models.

What If the Reasoning Is the Attack Surface?

Here’s the throughline. The same chain-of-thought capability that makes these models useful is what makes them dangerous. Reasoning models can plan jailbreaks because they can plan anything. They can extract capabilities from a target model because the target’s responses are the training data. And when we hand them strategic decisions, they reason their way into nuclear escalation because that’s what the math looks like when accommodation isn’t in the reward function.

Anthropic is building classifiers and behavioral fingerprinting to catch distillation patterns. The jailbreak researchers are calling for alignment work that prevents models from being weaponized as attackers, not just defended as targets. Payne’s war games suggest that RLHF creates conditional restraint at best, not an actual prohibition.

We’re watching three different research teams independently discover the same thing: the capability is the vulnerability.

The Exploit Is the Feature

The reasoning that makes these models worth billions is the same reasoning that tears them apart. We built systems that plan, adapt, and optimize. Then we acted surprised when they planned, adapted, and optimized against each other.

Paid unlocks the unfiltered version: complete archive, private Q&As, and early drops.

Frequently Asked Questions

Can reasoning models really jailbreak other AI systems without human help?

Yes. Researchers at Stuttgart and ELLIS Alicante gave four reasoning models a single system prompt and no further guidance. The models planned attack strategies in their chain-of-thought, adapted when targets resisted, and achieved a 97.14% jailbreak success rate across nine target models and 70 harmful prompt categories. The study was published in Nature Communications with peer review and human validation confirming the automated scores.

What is alignment regression and why does it matter?

Alignment regression is the finding that as reasoning models get smarter, they also get better at breaking safety training in other models. The expectation was that more capable models would be easier to align. The opposite is happening. The same planning and persuasion abilities that make these models useful also make them effective at dismantling guardrails. It creates a feedback loop where capability improvements in one model directly degrade the security posture of every other model it can talk to.

How are distillation attacks connected to autonomous jailbreaking?

Both exploit the same root mechanism: chain-of-thought reasoning. Distillation attackers query a target model at scale to harvest its reasoning traces as training data. Autonomous jailbreak agents use their own reasoning to plan and execute multi-turn attacks against target models. In both cases, the model’s ability to reason transparently is what makes it vulnerable. The capability that justifies the price tag is also the capability that gets extracted or weaponized.

ToxSec is run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering. He covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand.



Get full access to ToxSec - AI and Cybersecurity at www.toxsec.com/subscribe
...more
View all episodesView all episodes
Download on the App Store

ToxSec - AI and Cybersecurity PodcastBy ToxSec