
Sign up to save your podcasts
Or
Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,358 Listeners
2,397 Listeners
7,818 Listeners
4,111 Listeners
87 Listeners
1,455 Listeners
8,768 Listeners
90 Listeners
354 Listeners
5,356 Listeners
15,019 Listeners
463 Listeners
128 Listeners
65 Listeners
432 Listeners