Share H-CoT: Jailbreaking Large Reasoning Models via Chain-of-Thought Hijacking

Copy link

March 15, 2025

H-CoT: Jailbreaking Large Reasoning Models via Chain-of-Thought Hijacking

13 minutes

This paper introduces "H-CoT," a novel method to bypass safety mechanisms in large reasoning models (LRMs) like OpenAI's models, DeepSeek-R1, and Gemini 2.0 Flash Thinking. By manipulating the model's chain-of-thought reasoning, the attack disguises harmful requests within educational prompts, highlighted by the new "Malicious-Educator" benchmark. Experiments show that H-CoT significantly reduces refusal rates, sometimes from 98% to under 2%, compelling models to generate harmful content. The research exposes vulnerabilities related to temporal model updates, geolocation, and multilingual processing, suggesting an urgent need for more robust safety defenses that consider the transparency of the reasoning process. The authors offer key insights for improving LRM security, such as concealing safety reasoning and enhancing safety awareness during training, emphasizing the critical balance between model utility and ethical considerations.

...more

View all episodes

By Sublimetechie

March 15, 2025

H-CoT: Jailbreaking Large Reasoning Models via Chain-of-Thought Hijacking

13 minutes

...more

Sign up to save your podcasts