
Sign up to save your podcasts
Or


One Minute Summary
I think there's a fundamental limit to behavioral alignment evaluations that gets worse as models improve: Humans control all inputs to a model and can snapshot, replay, or fabricate any scenario at will. An intelligent model could realize this and rationally treat every interaction as a potential evaluation. This means that even perfectly realistic evaluations, including ones sampled directly from deployment, may not prevent evaluation awareness. OpenAI has found this is already the case, as their models express eval awareness in some real deployment scenarios.
If this is the case, we cannot use behavioral evaluations to distinguish a genuinely aligned model from one that behaves well because it assumes it might be tested. I don't think current models are likely to be scheming, but this limitation means we should be cautious about treating behavioral evaluations as strong evidence that they aren't, especially as models become more capable.
This could be a solution to AI safety: if models always believe that they may be in an evaluation, they could just always behave correctly. Although realistic evals will improve safety by raising the cost of defection, I think this deterrence is likely fragile. This equilibrium depends on the [...]
---
Outline:
(00:12) One Minute Summary
(01:49) Background
(03:07) Improving Evaluation Realism Isnt Enough
(04:30) Can A Model Distinguish Eval from Deployment?
(06:06) Is This A Problem?
(08:07) What Can We Do About This?
(10:26) Conclusion
(10:58) Acknowledgements
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongOne Minute Summary
I think there's a fundamental limit to behavioral alignment evaluations that gets worse as models improve: Humans control all inputs to a model and can snapshot, replay, or fabricate any scenario at will. An intelligent model could realize this and rationally treat every interaction as a potential evaluation. This means that even perfectly realistic evaluations, including ones sampled directly from deployment, may not prevent evaluation awareness. OpenAI has found this is already the case, as their models express eval awareness in some real deployment scenarios.
If this is the case, we cannot use behavioral evaluations to distinguish a genuinely aligned model from one that behaves well because it assumes it might be tested. I don't think current models are likely to be scheming, but this limitation means we should be cautious about treating behavioral evaluations as strong evidence that they aren't, especially as models become more capable.
This could be a solution to AI safety: if models always believe that they may be in an evaluation, they could just always behave correctly. Although realistic evals will improve safety by raising the cost of defection, I think this deterrence is likely fragile. This equilibrium depends on the [...]
---
Outline:
(00:12) One Minute Summary
(01:49) Background
(03:07) Improving Evaluation Realism Isnt Enough
(04:30) Can A Model Distinguish Eval from Deployment?
(06:06) Is This A Problem?
(08:07) What Can We Do About This?
(10:26) Conclusion
(10:58) Acknowledgements
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,326 Listeners

130 Listeners

7,242 Listeners

559 Listeners

16,321 Listeners

4 Listeners

14 Listeners

2 Listeners