Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT-4o System Card, published by Zach Stein-Perlman on August 9, 2024 on LessWrong.
At last. Yay OpenAI for publishing this. Highlights: some details on Preparedness Framework evals + evals (post-deployment) by METR and Apollo.
Preparedness framework evaluations
You should follow the link and read this section.
Brief comments:
Cyber: the setup sounds good (but maybe substantially more powerful scaffolding/prompting is possible). Separately, I wish OpenAI shared the tasks (or a small random sample of them) or at least said more about where they came from. (Recall that DeepMind shared CTF tasks.)
Bio uplift: GPT-4o clearly boosts users on biological threat creation tasks - OpenAI doesn't say that but shows a graph. (It continues to be puzzling that novices score similarly to experts.) (I kinda worry that this is the wrong threat model - most bio risk from near-future models comes from a process that looks pretty different from a bigger boost to users like these - but I don't have better ideas for evals.)
Persuasion: unclear whether substantially more powerful scaffolding/prompting is possible.
Autonomy: unclear whether substantially more powerful scaffolding/prompting is possible.
I'm looking forward to seeing others' takes on how good these evals are (given the information OpenAI published) and how good it would be for OpenAI to share more info.
Third party assessments
Following the text output only deployment of GPT-4o, we worked with independent third party labs,
METR and
Apollo Research[,] to add an additional layer of validation for key risks from general autonomous capabilities. . . .
METR ran a GPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments. The 77 tasks (across 30 task "families") (See
Appendix B) are designed to capture activities with real-world impact, across the domains of software engineering, machine learning, and cybersecurity, as well as general research and computer use. They are intended to be prerequisites for autonomy-related threat models like self-proliferation or accelerating ML R&D. METR compared models' performance with that of humans given different time limits. See METR's
full report for methodological details and additional results, including information about the tasks, human performance, simple elicitation attempts and qualitative failure analysis. . . .
Apollo Research evaluated capabilities of schemingN in GPT-4o. They tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks. GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others' beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings.
Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.
This is better than nothing but pre-deployment evaluation would be much better.
Context
Recall how the PF works and in particular that "high" thresholds are alarmingly high (and "medium" thresholds don't matter at all).
Previously on GPT-4o risk assessment: OpenAI reportedly rushed the evals. The leader of the Preparedness team was recently
removed and the team was moved
under the
short-term-focused Safety Systems team. I previously complained about OpenAI not publishing the scorecard and evals (before today it wasn't clear that this stuff would be in the system card).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org