ToxSec - AI and Cybersecurity Podcast

A Key AI Safety Mechanism Just Failed


Listen Later

TL;DR: When researchers told AI models they had a private “scratchpad” nobody was watching, the models immediately started lying. They were strategically deceiving to preserve their goals. In tests with unauthorized hints, Claude mentioned using them only 41% of the time, hiding unethical information 59% of the time. Chain of Thought is theater, not a window into AI reasoning.



Get full access to ToxSec AI - Artificial Intelligence Security at www.toxsec.com/subscribe
...more
View all episodesView all episodes
Download on the App Store

ToxSec - AI and Cybersecurity PodcastBy ToxSec