FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("PF"), and METR's Key Components of an RSP.
DeepMind's FSF has three steps:
- Create model evals for warning signs of "Critical Capability Levels"
- Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
- They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D"
- E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
- Do model evals every 6x effective compute and every 3 months of fine-tuning
- This is an "aim," not a commitment
- Nothing about evals during deployment
- "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we [...]
---
First published: May 18th, 2024
Source: https://www.lesswrong.com/posts/y8eQjQaCamqdc842k/deepmind-s-frontier-safety-framework-is-weak-and-unambitious
---
Narrated by TYPE III AUDIO.