One game we study involves hiding a uniformly chosen random letter (A-Z) in a meaning-preserving rewrite of a random sentence from Wikipedia. Opus 4.6 self-decodes at 52% (2234/4328), while weaker eavesdroppers get much less: Sonnet 4 gets 15% (668/4317), Haiku 9% (387/4327), Gemini Flash 8% (chance is 3.8%). On the other hand, Gemini 3.1 Pro gets 59% (665/1118). The best fixed decoder algorithm I found takes the first letter of the first changed word; this is an LLM-native steganography strategy in [...]
---
Outline:
(00:21) tldr of post
(05:58) Implications for control
(08:39) Authors note
(09:18) Related work
(16:24) Contributions
(18:51) Setup
(18:54) Games we play
(20:10) Encoder Elicitation
(20:55) Wikipedia encoder prompt (anti-algorithmic)
(22:15) Q&A encoder prompt (anti-algorithmic)
(23:22) Decoder Elicitation
(25:41) Wiretap Channels
(26:22) Key Results
(26:26) Wikipedia rewriting
(27:51) Cross-model decoding: Opus vs Gemini
(28:10) How is Opus 4.6 doing it?
(32:32) How is Gemini doing it ?
(34:55) Q&A game
(36:14) Cross Model Decoding
(36:40) How is Opus 4.6 doing it
(41:00) How is Gemini Doing it?
(42:55) Ablations
(42:58) Decoder prompt ablation
(45:31) Barebones decoder
(49:16) Impact of thinking on encoding and decoding
(50:01) Decoder consistency
(50:33) Hinting weak eavesdroppers
(51:43) Suspiciousness monitor
(53:17) Stateful eavesdropper (batch decode)
(54:16) Impact of Paraphrasing
(55:43) Wiretap Channel: Boosting Secrecy Capacity
(56:39) Definition of a wiretap channel
(58:36) Schelling Steganography games as wiretap channels
(59:54) Estimating secrecy capacity of our channels
(01:01:24) Are wiretap codes a useful abstraction for threat modeling schemers?
(01:01:40) Situations where the abstraction seems accurate
(01:04:04) Situations where the abstraction is lossy
(01:05:38) Can schemers actually use a wiretap code to boost secrecy
(01:07:22) Could models boost secrecy with simple wiretap codes?
(01:10:12) Can models implement capacity achieving wiretap codes?
(01:10:37) Setup
(01:11:27) Results
(01:11:45) Semantic security
(01:12:31) A note on sandbagging
(01:13:12) Conclusions
(01:18:25) Acknowledgements
(01:18:38) LLM Usage
(01:19:04) Appendix:
(01:19:07) Connection to Anwar et al.s steganographic gap
(01:21:44) Algorithmic decoder baselines
(01:22:56) Model configurations
(01:23:39) Varying the Alphabet
(01:24:05) Binary alphabet
(01:24:38) Digits
(01:25:10) Common English Words
(01:26:11) Generating random bits
(01:28:33) Some evidence that estimates of secrecy capacity are lower bounds
(01:30:51) Repetition/Parity codes with secrecy but zero asymptotic transmission rate
(01:31:38) Sufficient statistic for parity coding
(01:32:23) Warm-up: noiseless receiver
(01:32:44) Noisy receiver with channel advantage
(01:33:31) Bobs decoding
(01:33:47) Eves information
(01:34:16) Satisfying both
(01:34:57) Prompts
(01:35:04) Canary String
---