
Sign up to save your podcasts
Or
This paper introduces "H-CoT," a novel method to bypass safety mechanisms in large reasoning models (LRMs) like OpenAI's models, DeepSeek-R1, and Gemini 2.0 Flash Thinking. By manipulating the model's chain-of-thought reasoning, the attack disguises harmful requests within educational prompts, highlighted by the new "Malicious-Educator" benchmark. Experiments show that H-CoT significantly reduces refusal rates, sometimes from 98% to under 2%, compelling models to generate harmful content. The research exposes vulnerabilities related to temporal model updates, geolocation, and multilingual processing, suggesting an urgent need for more robust safety defenses that consider the transparency of the reasoning process. The authors offer key insights for improving LRM security, such as concealing safety reasoning and enhancing safety awareness during training, emphasizing the critical balance between model utility and ethical considerations.
This paper introduces "H-CoT," a novel method to bypass safety mechanisms in large reasoning models (LRMs) like OpenAI's models, DeepSeek-R1, and Gemini 2.0 Flash Thinking. By manipulating the model's chain-of-thought reasoning, the attack disguises harmful requests within educational prompts, highlighted by the new "Malicious-Educator" benchmark. Experiments show that H-CoT significantly reduces refusal rates, sometimes from 98% to under 2%, compelling models to generate harmful content. The research exposes vulnerabilities related to temporal model updates, geolocation, and multilingual processing, suggesting an urgent need for more robust safety defenses that consider the transparency of the reasoning process. The authors offer key insights for improving LRM security, such as concealing safety reasoning and enhancing safety awareness during training, emphasizing the critical balance between model utility and ethical considerations.