Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Five projects from AI Safety Hub Labs 2023, published by Charlie Griffin on November 8, 2023 on The AI Alignment Forum.
AI Safety Hub Labs is a research programme that helps early-career researchers to complete an AI safety research project. Projects are completed in groups of 3-5 participants, supervised by a more senior safety researcher, and managed by AI Safety Hub. This summer's programme was unpaid due to funding constraints. It consisted of 12 weeks of either part- or full-time research. The goal for participants was to produce a preprint in the style of an ML conference/workshop.
The original motivation for the programme was to empower people to start working on AI safety research. We feel that we met this objective, but we were also pleasantly surprised by the quality of research produced by our teams in just 12 weeks. So far, three groups have had papers accepted to workshops, and two groups have papers under review.
In this post, we want to share an overview of the five research projects. You can find links to the full versions of the papers and blog posts below. Since we have chosen to keep this post short, you can contact
[email protected] for more information about the programme. We are currently looking for supervisors and organisers for the Labs 2024 programme.
Paper 1: Deception in LLMs
(paper under review; blog post available)
Supervisor:
Francis Rhys Ward
Participants:
Harriet Wood,
Felix Hofstätter,
Oliver Jaffe,
Louis Thomson, Patrik Bartak
Problem: Language is a natural medium for deception, and there is growing evidence that language models (LMs) can deceive humans and other AI systems. However, it is still unclear how to evaluate the deceptiveness of LMs. One philosophical notion of deception involves one agent causing another agent to have a false belief, but the ascription of agency and beliefs to LMs is contentious.
While there are formal definitions of deception in philosophy and AI research, the details of their applications to LMs still need to be worked out. Our research aims to bridge this gap between theory and practice. We aim to provide an in-depth evaluation of deceptive capabilities and their scaling trends in state-of-the-art language models.
deceptive alignment, which is considered a significant contributing factor to
existential risk from artificial intelligence. We only focus on deception caused by reward hacking, but we believe that developing proper evaluations in this setting can be a stepping stone towards testing for deceptive alignment.
Contribution: In
a previous paper, Ward et al. formalised deception in AI systems in terms of the beliefs and intentions of agents. Leaving the evaluation of intent to future work, we focus on agency and beliefs. We argue that consistency of beliefs is an important aspect of agency and evaluate the consistency of an LM's revealed beliefs in a scenario-based setting. Our results suggest that LMs become more consistent as the compute spent on training and inference increases.
Then, we show that LMs learn to lie when trained with a reward signal from a systematically biased evaluator. In this setting, we use the novel notion of accepted beliefs to show that our trained LMs do not always believe the lies they tell, making them deceptive. As in the first setting, we find scaling trends for deceptive behaviour. Larger LMs learn to target lies towards cases where the evaluator makes mistakes. They also learn to do so from fewer evaluator errors in the training set. Furthermore, for larger models, lying generalises to different contexts, and they learn to reaffirm their lies even though they were not trained to do so.
Limitations: We only evaluate how deception arises due to goal misspecification and do not consider other sources, such as goal misgeneralisation. Our work could help mitigate ex...