
Sign up to save your podcasts
Or
Epistemic status: My coauthor and I are both noobs in this field. Expect errors and conceptual flaws.
tl;dr
For a four-day capstone project for the ARENA program my partner and I did a replication of the MELBO Lesswrong article using Llama-3.2-1b-Instruct.
Intro
I've been spending the last month at the ARENA program developing some basic technical skills to match my long-standing layman's interest in AI and AI X-risk. And ARENA's final week is a capstone project asking us to engage with current literature and do something vaguely novel with it over the span of about a week.
This being a weeklong project, we needed to scope it very tightly. So my partner and I chose to replicate MELBO on a different LLM! The project of "finding, in an unsupervised way, plausible out-of-distribution model behavior" seemed interesting for its own sake and also possibly useful for the task [...]
---
Outline:
(00:21) tl;dr
(00:35) Intro
(01:32) Summary
(03:00) R hyperparameter sweep
(06:44) Concept-specific steering vectors
(07:02) Red teaming
(11:11) The Negation Non-Refusal
(11:56) Non-English language vectors
(13:06) Mental health/suicide prevention
(15:08) Github repo
(15:17) Possible Followups
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: My coauthor and I are both noobs in this field. Expect errors and conceptual flaws.
tl;dr
For a four-day capstone project for the ARENA program my partner and I did a replication of the MELBO Lesswrong article using Llama-3.2-1b-Instruct.
Intro
I've been spending the last month at the ARENA program developing some basic technical skills to match my long-standing layman's interest in AI and AI X-risk. And ARENA's final week is a capstone project asking us to engage with current literature and do something vaguely novel with it over the span of about a week.
This being a weeklong project, we needed to scope it very tightly. So my partner and I chose to replicate MELBO on a different LLM! The project of "finding, in an unsupervised way, plausible out-of-distribution model behavior" seemed interesting for its own sake and also possibly useful for the task [...]
---
Outline:
(00:21) tl;dr
(00:35) Intro
(01:32) Summary
(03:00) R hyperparameter sweep
(06:44) Concept-specific steering vectors
(07:02) Red teaming
(11:11) The Negation Non-Refusal
(11:56) Non-English language vectors
(13:06) Mental health/suicide prevention
(15:08) Github repo
(15:17) Possible Followups
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,370 Listeners
2,386 Listeners
7,925 Listeners
4,134 Listeners
87 Listeners
1,456 Listeners
9,048 Listeners
87 Listeners
387 Listeners
5,420 Listeners
15,207 Listeners
472 Listeners
120 Listeners
75 Listeners
456 Listeners