Share "How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

Copy link

January 12, 2023

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

33 minutes

---
client: lesswrong
project_id: curated
feed_id: ai, ai_safety, ai_safety__technical
narrator: pw
qa: km
narrator_time: 2h15m
qa_time: 0h35m
---
A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.

In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.

Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.

Original article:
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

...more

View all episodes

By TYPE III AUDIO

January 12, 2023

"How 'Discovering Latent Knowledge in Language Models Without Supervision' Fits Into a Broader Alignment Scheme" by Collin

33 minutes

Share feedback on this narration.

...more

Sign up to save your podcasts