February 08, 2023

"SolidGoldMagikarp (plus, prompt generation)"

33 minutes

---
client: lesswrong
project_id: curated
feed_id: ai, ai_safety, ai_safety__technical
narrator: pw
qa: km
---
TL;DR

Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)

We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.)
Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen).

Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:

eliciting knowledge
generating adversarial inputs
automating prompt search (e.g. for fine-tuning)

In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.

Original article:
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

View all episodes

By TYPE III AUDIO

February 08, 2023

"SolidGoldMagikarp (plus, prompt generation)"

33 minutes

---
client: lesswrong
project_id: curated
feed_id: ai, ai_safety, ai_safety__technical
narrator: pw
qa: km
---
TL;DR

Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)

We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.)
Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen).

Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:

eliciting knowledge
generating adversarial inputs
automating prompt search (e.g. for fine-tuning)

Original article:
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

Share "SolidGoldMagikarp (plus, prompt generation)"

Sign up to save your podcasts

"SolidGoldMagikarp (plus, prompt generation)"

"SolidGoldMagikarp (plus, prompt generation)"