This is a link post. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty related issue of reasoning models and providing tools for trustworthy AI alignment.
This seems like a positive breakthrough for mech interp research generally, the team used RepE to identify features, and were able to "reliably suppress or induce strategic deception".
---
First published:
June 9th, 2025
Source:
https://www.lesswrong.com/posts/3WyFmtiLZTfEQxJCy/identifying-deception-vectors-in-models
Linkpost URL:
https://arxiv.org/pdf/2506.04909
---
Narrated by TYPE III AUDIO.