January 22, 2025

“When does capability elicitation bound risk?” by Josh Clymer

43 minutes

For a summary of this post see the thread on X.

The assumptions behind and limitations of capability elicitation have been discussed in multiple places (e.g. here, here, here, etc); however, none of these match my current picture.

For example, Evan Hubinger's “When can we trust model evaluations?” cites “gradient hacking” (strategically interfering with the learning process) as the main reason supervised fine-tuning (SFT) might fail, but I’m worried about more boring failures. For example, models might perform better on tasks by following inhuman strategies than by imitating human demonstrations, SFT might not be sample-efficient enough – or SGD might not converge at all if models have deeply recurrent architectures and acquire capabilities in-context that cannot be effectively learned with SGD (section 5.2). Ultimately, the extent to which elicitation is effective is a messy empirical question.

In this post, I dissect the assumptions behind capability elicitation in detail [...]

---

Outline:

(01:49) 1. Background

(01:52) 1.1. What is a capability evaluation?

(04:01) 1.2. What is elicitation?

(09:21) 2. Two elicitation methods and two arguments

(15:48) 3. Justifying Sandbagging is unlikely

(19:47) 4. Justifying Elicitation trains against sandbagging

(21:05) 4.1. The number of i.i.d. samples is limited

(22:15) 4.2. Often, the train and test set are not sampled i.i.d.

(25:50) 5. Justifying Elicitation removes sandbagging that is trained against

(26:58) 5.1. Justifying The training signal punishes underperformance

(30:21) 5.2. Justifying In cases where training punishes underperformance, elicitation is effective

(31:07) Why supervised learning might or might not elicit capabilities

(35:02) Why reinforcement learning might or might not elicit capabilities

(37:00) Empirically justifying the effectiveness of elicitation given a strong training signal

(41:04) 6. Conclusion

(41:39) Appendix

(41:42) Appendix A: An argument that supervised fine-tuning quickly removes underperformance behavior in general

---

First published:

January 22nd, 2025

Source:

https://redwoodresearch.substack.com/p/when-does-capability-elicitation

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By Redwood Research

January 22, 2025

“When does capability elicitation bound risk?” by Josh Clymer

43 minutes

For a summary of this post see the thread on X.

The assumptions behind and limitations of capability elicitation have been discussed in multiple places (e.g. here, here, here, etc); however, none of these match my current picture.

In this post, I dissect the assumptions behind capability elicitation in detail [...]

---

Outline:

(01:49) 1. Background

(01:52) 1.1. What is a capability evaluation?

(04:01) 1.2. What is elicitation?

(09:21) 2. Two elicitation methods and two arguments

(15:48) 3. Justifying Sandbagging is unlikely

(19:47) 4. Justifying Elicitation trains against sandbagging

(21:05) 4.1. The number of i.i.d. samples is limited

(22:15) 4.2. Often, the train and test set are not sampled i.i.d.

(25:50) 5. Justifying Elicitation removes sandbagging that is trained against

(26:58) 5.1. Justifying The training signal punishes underperformance

(30:21) 5.2. Justifying In cases where training punishes underperformance, elicitation is effective

(31:07) Why supervised learning might or might not elicit capabilities

(35:02) Why reinforcement learning might or might not elicit capabilities

(37:00) Empirically justifying the effectiveness of elicitation given a strong training signal

(41:04) 6. Conclusion

(41:39) Appendix

(41:42) Appendix A: An argument that supervised fine-tuning quickly removes underperformance behavior in general

---

First published:

January 22nd, 2025

Source:

https://redwoodresearch.substack.com/p/when-does-capability-elicitation

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

Share “When does capability elicitation bound risk?” by Josh Clymer

Sign up to save your podcasts

“When does capability elicitation bound risk?” by Josh Clymer

“When does capability elicitation bound risk?” by Josh Clymer