October 03, 2025

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

5 minutes

This is a link post.

TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.

See our X thread and full paper for details.

Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user's gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.

Summary

We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user's gender. Models are trained to apply this secret [...]

---

Outline:

(01:05) Summary

(02:24) Introduction

---

First published:

October 2nd, 2025

Source:

https://www.lesswrong.com/posts/Mv3yg7wMXfns3NPaz/eliciting-secret-knowledge-from-language-models-1

Linkpost URL:
https://arxiv.org/abs/2510.01070

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong