January 28, 2025

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

14 minutes

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).

Abstract

We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do [...]

---

Outline:

(00:39) Abstract

(02:18) Introduction

(11:41) Discussion

(11:44) AI safety

(12:42) Limitations and future work

The original text contained 3 images which were described by AI.

---

First published:
January 22nd, 2025

Source:
https://www.lesswrong.com/posts/xrv2fNJtqabN3h6Aj/tell-me-about-yourself-llms-are-aware-of-their-implicit

---

Narrated by TYPE III AUDIO.

---

Images from the article:

...more