
Sign up to save your podcasts
Or
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
4.4
88 ratings
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
26,462 Listeners
2,395 Listeners
1,784 Listeners
298 Listeners
105 Listeners
4,142 Listeners
89 Listeners
287 Listeners
88 Listeners
417 Listeners
243 Listeners
75 Listeners
60 Listeners
146 Listeners
123 Listeners