This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.
Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).
Abstract
We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do [...]
---
Outline:
(00:39) Abstract
(02:18) Introduction
(11:41) Discussion
(11:44) AI safety
(12:42) Limitations and future work
The original text contained 3 images which were described by AI.
---
First published:
January 22nd, 2025
Source:
https://www.lesswrong.com/posts/xrv2fNJtqabN3h6Aj/tell-me-about-yourself-llms-are-aware-of-their-implicit
---
Narrated by TYPE III AUDIO.
---
Images from the article: