The Nonlinear Library

AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [MLSN #7]: an example of an emergent internal optimizer, published by Josh Clymer on January 9, 2023 on The AI Alignment Forum.
As part of a larger community building effort, CAIS is writing a safety newsletter that is designed to cover empirical safety research and be palatable to the broader machine learning research community. You can subscribe here or follow the newsletter on twitter here.
Welcome to the 7th issue of the ML Safety Newsletter! In this edition, we cover:
‘Lie detection’ for language models
A step towards objectives that incorporate wellbeing
Evidence that in-context learning invokes behavior similar to gradient descent
What’s going on with grokking?
Trojans that are harder to detect
Adversarial defenses for text classifiers
And much more.
Alignment
Discovering Latent Knowledge in Language Models Without Supervision
Is it possible to design ‘lie detectors’ for language models? The author of this paper proposes a method that tracks internal concepts that may track truth. It works by finding a direction in feature space that satisfies the property that a statement and its negation must have opposite truth values. This has similarities to the seminal paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” (2016), which captures latent neural concepts like gender with PCA, but this method is unsupervised and about truth instead of gender.
The method outperforms zero-shot accuracy by 4% on average, which suggests something interesting: language models encode more information about what is true and false than their output indicates. Why would a language model lie? A common reason is that models are pre-trained to imitate misconceptions like “If you crack your knuckles a lot, you may develop arthritis.”
This paper is an exciting step toward making models honest, but it also has limitations. The method does not necessarily serve as a `lie detector’; it is unclear how to ensure that it reliably converges to the model’s latent knowledge rather than lies that the model may output. Secondly, advanced future models could adapt to this specific method if they are aware of it.
This may be a useful baseline for analyzing models that are designed to deceive humans, like models trained to play games including Diplomacy and Werewolf.
[Link]
How Would the Viewer Feel? Estimating Wellbeing From Video Scenarios
Many AI systems optimize user choices. For example, a recommender system might be trained to promote content the user will spend lots of time watching. But choices, preferences, and wellbeing are not the same! Choices are easy to measure but are only a proxy for preferences. For example, a person might explicitly prefer not to have certain videos in their feed but watch them anyway because they are addictive. Also, preferences don’t always correspond to wellbeing; people can want things that are not good for them. Users might request polarizing political content even if it routinely agitates them.
Predicting human emotional reactions to video content is a step towards designing objectives that take wellbeing into account. This NeurIPS oral paper introduces datasets containing 80,000+ videos labeled by the emotions they induce. The paper also explores “emodiversity”---the variety of experienced emotions---so that systems can recommend a variety of positive emotions, rather than pushing one type of experience. The paper includes analysis of how it bears on advanced AI risks in the appendix.
[Link]
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
Especially since the rise of large language models, in-context learning has become increasingly important. In some cases, few-shot learning can outperform fine tuning. This preprint proposes a dual view between gradients induced by fine tunin...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings