January 09, 2023

AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer

10 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [MLSN #7]: an example of an emergent internal optimizer, published by Josh Clymer on January 9, 2023 on The AI Alignment Forum.

As part of a larger community building effort, CAIS is writing a safety newsletter that is designed to cover empirical safety research and be palatable to the broader machine learning research community. You can subscribe here or follow the newsletter on twitter here.

Welcome to the 7th issue of the ML Safety Newsletter! In this edition, we cover:

‘Lie detection’ for language models

A step towards objectives that incorporate wellbeing

Evidence that in-context learning invokes behavior similar to gradient descent

What’s going on with grokking?

Trojans that are harder to detect

Adversarial defenses for text classifiers

And much more.

Alignment

Discovering Latent Knowledge in Language Models Without Supervision

Is it possible to design ‘lie detectors’ for language models? The author of this paper proposes a method that tracks internal concepts that may track truth. It works by finding a direction in feature space that satisfies the property that a statement and its negation must have opposite truth values. This has similarities to the seminal paper “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” (2016), which captures latent neural concepts like gender with PCA, but this method is unsupervised and about truth instead of gender.

The method outperforms zero-shot accuracy by 4% on average, which suggests something interesting: language models encode more information about what is true and false than their output indicates. Why would a language model lie? A common reason is that models are pre-trained to imitate misconceptions like “If you crack your knuckles a lot, you may develop arthritis.”

This paper is an exciting step toward making models honest, but it also has limitations. The method does not necessarily serve as a `lie detector’; it is unclear how to ensure that it reliably converges to the model’s latent knowledge rather than lies that the model may output. Secondly, advanced future models could adapt to this specific method if they are aware of it.

This may be a useful baseline for analyzing models that are designed to deceive humans, like models trained to play games including Diplomacy and Werewolf.

[Link]

How Would the Viewer Feel? Estimating Wellbeing From Video Scenarios

Many AI systems optimize user choices. For example, a recommender system might be trained to promote content the user will spend lots of time watching. But choices, preferences, and wellbeing are not the same! Choices are easy to measure but are only a proxy for preferences. For example, a person might explicitly prefer not to have certain videos in their feed but watch them anyway because they are addictive. Also, preferences don’t always correspond to wellbeing; people can want things that are not good for them. Users might request polarizing political content even if it routinely agitates them.

Predicting human emotional reactions to video content is a step towards designing objectives that take wellbeing into account. This NeurIPS oral paper introduces datasets containing 80,000+ videos labeled by the emotions they induce. The paper also explores “emodiversity”---the variety of experienced emotions---so that systems can recommend a variety of positive emotions, rather than pushing one type of experience. The paper includes analysis of how it bears on advanced AI risks in the appendix.

[Link]

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

Especially since the rise of large language models, in-context learning has become increasingly important. In some cases, few-shot learning can outperform fine tuning. This preprint proposes a dual view between gradients induced by fine tunin...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 09, 2023

AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer

10 minutes

Welcome to the 7th issue of the ML Safety Newsletter! In this edition, we cover:

‘Lie detection’ for language models

A step towards objectives that incorporate wellbeing

Evidence that in-context learning invokes behavior similar to gradient descent

What’s going on with grokking?

Trojans that are harder to detect

Adversarial defenses for text classifiers

And much more.

Alignment

Discovering Latent Knowledge in Language Models Without Supervision

This may be a useful baseline for analyzing models that are designed to deceive humans, like models trained to play games including Diplomacy and Werewolf.

[Link]

How Would the Viewer Feel? Estimating Wellbeing From Video Scenarios

[Link]

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

...more

Share AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer

Sign up to save your podcasts

AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer

AF - [MLSN #7]: an example of an emergent internal optimizer by Josh Clymer