December 04, 2023

AF - 2023 Alignment Research Updates from FAR AI by AdamGleave

12 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2023 Alignment Research Updates from FAR AI, published by AdamGleave on December 4, 2023 on The AI Alignment Forum.

TL;DR: FAR AI's science of robustness agenda has found vulnerabilities in superhuman Go systems; our value alignment research has developed more sample-efficient value learning algorithms; and our model evaluation direction has developed a variety of new black-box and white-box evaluation methods.

FAR AI is a non-profit AI safety research institute, working to incubate a diverse portfolio of research agendas. We've been growing rapidly and are excited to share some highlights from our research projects since we were founded just over a year ago. We've also been busy running field-building events and setting up a coworking space - see our overview post for more information on our non-research activities.

Our Mission

We need safety techniques that can provide demonstrable guarantees of the safety of advanced AI systems. Unfortunately, currently deployed alignment methods like Reinforcement Learning from Human Feedback (RLHF) fall short of this standard. Proposals that could provide stronger safety guarantees exist but are in the very early stages of development.

Our mission is to incubate and accelerate these early-stage approaches, so they can be empirically tested and deployed. We focus on research agendas that are too large to be pursued by individual academic or independent researchers but are too early-stage to be of interest to most for-profit organizations.

We take bets on a range of these promising early-stage agendas and then scale up those that prove most successful. Unlike other research organizations that take bets on specific agendas, our structure allows us to both (1) explore a range of agendas and (2) execute them at scale. Our current bets fall into three categories:

Science of Robustness: How does robustness vary with model size? Will superhuman systems be vulnerable to adversarial examples or "jailbreaks" similar to those seen today? And, if so, how can we achieve safety-critical guarantees?

Value Alignment: How can we learn reliable reward functions from human data? Our research focuses on enabling higher bandwidth, more sample-efficient methods for users to communicate preferences for AI systems; and improved methods to enable training with human feedback.

Model Evaluation: How can we evaluate and test the safety-relevant properties of state-of-the-art models? Evaluation can be split into black-box approaches that focus only on externally visible behavior ("model testing"), and white-box approaches that seek to interpret the inner workings ("interpretability"). These approaches are complementary, with black-box approaches less powerful but easier to use than white-box methods, so we pursue research in both areas.

Science of Robustness

No engineered component is indestructible. When designing physical structures, engineers estimate how much stress each component needs to withstand, add an appropriate safety margin, and then choose components with the appropriate tolerance. This enables safe and cost-effective construction: bridges rarely fall down, nor are they over-engineered.

AI components such as LLMs or computer vision classifiers are far from indestructible, being plagued by adversarial examples and vulnerability to distribution shift. Unfortunately, AI currently has no equivalent to the stress calculations of civil engineers.

So far the best approach we have is to guess-and-check: train a model, and then subject it to a battery of tests to determine its capabilities and limitations. But this approach gives little theoretical basis for how to improve systems. And both the training and testing of models are increasingly expensive and labor-intensive (with the cost of foundation model training now rivaling that of the construction o...

...more