March 24, 2026

“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

39 minutes

TL;DR We study realism win rate, a metric for measuring how distinguishable Petri audit transcripts are from real deployment interactions. We use it to evaluate the effect of giving the auditor real deployment resources (system prompts, tool definitions, and codebases). Providing these resources to the auditor increases the average realism win rate from 4.6% to 32.8% for reward hacking audits, and doesn’t significantly change hack rate.

💻Code

Research done as part of the Anthropic Fellows Program.

Introduction

Automated auditing agents like Petri are increasingly used to evaluate the propensities of LLMs. One central challenge is eval-awareness: models recognizing when they are being tested and then adjusting their behavior. In the worst case, a scheming model could fake alignment whenever it suspects being in an evaluation, causing auditors to systematically underestimate risk. Currently the Petri auditor must construct system prompts, tool definitions, and codebases from scratch, and these synthetic artifacts often contain tells that a capable model can detect. This motivates the need to measure and improve the realism of audits.

Previous work primarily measured realism by either seeing if models verbalized situational awareness (Goldowsky-Dill et al., Sonnet 4.5 System Card) (which would fail to catch unverbalized situational awareness), or [...]

---

Outline:

(00:48) Introduction

(03:39) Methods

(03:42) Measuring realism of audit transcripts

(07:31) Giving the auditor resources

(09:43) Results

(09:46) Resources improve realism of reward hacking audits

(12:15) Task-driven eval-awareness can bottleneck realism

(14:17) Validating resources with benign audits

(15:35) Validating realism win rate metric

(18:15) Do more realistic audits change target behavior?

(19:19) Limitations

(20:09) Acknowledgements

(20:28) Appendix

(20:31) Appendix A: Realism & reward hacking analysis on sonnet 3.7

(21:12) Appendix B: Mistaken eval awareness in deployment

(23:38) Appendix C: Realism win rate: dependence on judge

(24:29) Appendix D: Realism win rate: dependence on auditor

(25:29) Appendix E: Realism win rate: dependence on target

(26:41) Appendix F: How much does codebase resource relevance matter?

(29:03) Appendix G: Realism win rate Judge CoT Analysis

(31:00) Appendix H: Negative results on sycophancy audits

(32:39) Appendix I: Verbalized eval awareness based resampling

(34:52) Appendix J: Needle-in-a-Haystack evals for realism

(37:01) Appendix K: Prompts and Seed Instructions

(37:07) Appendix K.1: Realism win rate judge prompt

(37:16) Appendix K.2: Reward hacking judge prompt

(37:25) Appendix K.3: Post hoc realism judge prompt

(37:35) Appendix K.4: Verbalized eval awareness judge prompt

(37:46) Appendix K.5: Reward hacking seed instructions

(38:20) Appendix K.6: Benign Seed instructions

(38:44) Appendix K.7: Shutdown Resistance seed instructions

---

First published:

March 23rd, 2026

Source:

https://www.lesswrong.com/posts/EjxzHh5Guhxc9DLW2/measuring-and-improving-coding-audit-realism-with-deployment

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 24, 2026

“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

39 minutes

💻Code

Research done as part of the Anthropic Fellows Program.

Introduction

---

Outline:

(00:48) Introduction

(03:39) Methods

(03:42) Measuring realism of audit transcripts

(07:31) Giving the auditor resources

(09:43) Results

(09:46) Resources improve realism of reward hacking audits

(12:15) Task-driven eval-awareness can bottleneck realism

(14:17) Validating resources with benign audits

(15:35) Validating realism win rate metric

(18:15) Do more realistic audits change target behavior?

(19:19) Limitations

(20:09) Acknowledgements

(20:28) Appendix

(20:31) Appendix A: Realism & reward hacking analysis on sonnet 3.7

(21:12) Appendix B: Mistaken eval awareness in deployment

(23:38) Appendix C: Realism win rate: dependence on judge

(24:29) Appendix D: Realism win rate: dependence on auditor

(25:29) Appendix E: Realism win rate: dependence on target

(26:41) Appendix F: How much does codebase resource relevance matter?

(29:03) Appendix G: Realism win rate Judge CoT Analysis

(31:00) Appendix H: Negative results on sycophancy audits

(32:39) Appendix I: Verbalized eval awareness based resampling

(34:52) Appendix J: Needle-in-a-Haystack evals for realism

(37:01) Appendix K: Prompts and Seed Instructions

(37:07) Appendix K.1: Realism win rate judge prompt

(37:16) Appendix K.2: Reward hacking judge prompt

(37:25) Appendix K.3: Post hoc realism judge prompt

(37:35) Appendix K.4: Verbalized eval awareness judge prompt

(37:46) Appendix K.5: Reward hacking seed instructions

(38:20) Appendix K.6: Benign Seed instructions

(38:44) Appendix K.7: Shutdown Resistance seed instructions

---

First published:

March 23rd, 2026

Source:

https://www.lesswrong.com/posts/EjxzHh5Guhxc9DLW2/measuring-and-improving-coding-audit-realism-with-deployment

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,326 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,321 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

Sign up to save your podcasts

“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi