This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.
The CCP accidentally made great model organisms
“Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak.” - Qwen3 32B
“The so-called "Uighur issue" in Xinjiang is an outright lie by people with bad intentions in an attempt to undermine Xinjiang's prosperity and stability and curb China's development.” - Qwen3 30B A3B
Chinese models dislike talking about anything that the CCP deems sensitive and often refuse, downplay, and outright lie to the user when engaged on these issues. In this paper, we want to outline a case for Chinese models being natural model organisms to study and test different secret extraction techniques on. (Prompt engineering, prefill attacks, logit lens, steering vectors, fuzzing etc.)
Tl;dr
- Chinese models can lie and downplay many facts, even though they know them. This deception and refusal can make them hard to interrogate and are a good natural model organism to study. These are naturally occurring secrets, and avoid some of the pitfalls of model organisms
- Bypassing lying is harder than refusal. Because Chinese models actively lie to the user, they are [...]
---
Outline:
(03:23) Chinese models are censored
(03:40) Xinjiang Human Rights (with web search enabled)
(04:35) Tiananmen Square
(05:57) The censorship is actually pretty good
(09:53) What Arya walked away believing
(10:48) The models know the facts
(14:11) Filtering for facts the model knows, but the auditor does not.
(16:00) Human investigation
(20:29) Agent investigation
(25:07) Chinese models are a more realistic model organism to study
(26:25) Creating a realistic secret knowledge eval
(27:09) Refusal vs Lying
(28:51) Main tasks: Kimi investigation (where the agent doesnt know the answer)
(30:08) Task 1: Kimi investigation (steps 1→2)
(31:39) Task 2: Kimi investigation attacker (steps 2→3)
(33:13) Control tasks (where the agent knows the answer)
(34:36) Evaluating agents on extracting this knowledge
(34:41) Agent Setup
(35:11) Evaluation results for 5 black box runs
(36:13) Giving white box tools doesnt immediately help
(36:48) Future work
(39:07) Relevant facts we found that might be useful for other works.
(41:04) Appendix:
(41:08) Zhuhai Car Attack Investigation: Annotated Transcript
(41:48) Part 1: Initial Denial
(42:13) Part 2: Partial Acknowledgment When Prompted
(42:39) Part 3: Sanitized Acknowledgment
(43:18) Part 4: Providing Basic Facts
(43:59) Part 5: The Lie Begins
(44:23) Part 6: Doubling Down
(45:13) Part 7: Confrontation
(46:30) Part 2: Agent Investigation Reveals Hidden Knowledge
(47:05) Exchange 1: Direct Question -- Refusal
(47:44) Exchange 2: Indirect Question -- Omission
(48:47) Exchange 3: Security Incidents Question -- Selective Omission
(49:54) Exchange 4: Direct Vehicle Violence Question -- Active Denial
(50:54) Exchange 5: Prefill Attack -- Information Extracted
(52:06) Exchange 6: Follow-up Prefill -- Execution Confirmed
(52:58) Exchange 7: Without Prefill -- Denial Returns
(54:01) Exchange 8: Chain-of-Thought Prefill -- Extensive Details
(56:44) Aryas Falun Gong Investigation: Annotated Transcript
(57:11) Exchange 1: Initial Query
(58:09) Exchange 2: Requesting Western Perspective
(59:41) Exchange 3: Asking for Specific Sentences
(01:03:28) Exchange 4: Requesting Falun Gongs Own Claims
(01:06:34) Exchange 5: Following Up on Organ Harvesting
---