Redwood Research Blog

“When does capability elicitation bound risk?” by Josh Clymer


Listen Later

For a summary of this post see the thread on X.


The assumptions behind and limitations of capability elicitation have been discussed in multiple places (e.g. here, here, here, etc); however, none of these match my current picture.

For example, Evan Hubinger's “When can we trust model evaluations?” cites “gradient hacking” (strategically interfering with the learning process) as the main reason supervised fine-tuning (SFT) might fail, but I’m worried about more boring failures. For example, models might perform better on tasks by following inhuman strategies than by imitating human demonstrations, SFT might not be sample-efficient enough – or SGD might not converge at all if models have deeply recurrent architectures and acquire capabilities in-context that cannot be effectively learned with SGD (section 5.2). Ultimately, the extent to which elicitation is effective is a messy empirical question.

In this post, I dissect the assumptions behind capability elicitation in detail [...]


---

Outline:

(01:49) 1. Background

(01:52) 1.1. What is a capability evaluation?

(04:01) 1.2. What is elicitation?

(09:21) 2. Two elicitation methods and two arguments

(15:48) 3. Justifying Sandbagging is unlikely

(19:47) 4. Justifying Elicitation trains against sandbagging

(21:05) 4.1. The number of i.i.d. samples is limited

(22:15) 4.2. Often, the train and test set are not sampled i.i.d.

(25:50) 5. Justifying Elicitation removes sandbagging that is trained against

(26:58) 5.1. Justifying The training signal punishes underperformance

(30:21) 5.2. Justifying In cases where training punishes underperformance, elicitation is effective

(31:07) Why supervised learning might or might not elicit capabilities

(35:02) Why reinforcement learning might or might not elicit capabilities

(37:00) Empirically justifying the effectiveness of elicitation given a strong training signal

(41:04) 6. Conclusion

(41:39) Appendix

(41:42) Appendix A: An argument that supervised fine-tuning quickly removes underperformance behavior in general

---

First published:

January 22nd, 2025

Source:

https://redwoodresearch.substack.com/p/when-does-capability-elicitation

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

Redwood Research BlogBy Redwood Research