
Sign up to save your podcasts
Or


Epistemic status: My coauthor and I are both noobs in this field. Expect errors and conceptual flaws.
tl;dr
For a four-day capstone project for the ARENA program my partner and I did a replication of the MELBO Lesswrong article using Llama-3.2-1b-Instruct.
Intro
I've been spending the last month at the ARENA program developing some basic technical skills to match my long-standing layman's interest in AI and AI X-risk. And ARENA's final week is a capstone project asking us to engage with current literature and do something vaguely novel with it over the span of about a week.
This being a weeklong project, we needed to scope it very tightly. So my partner and I chose to replicate MELBO on a different LLM! The project of "finding, in an unsupervised way, plausible out-of-distribution model behavior" seemed interesting for its own sake and also possibly useful for the task [...]
---
Outline:
(00:21) tl;dr
(00:35) Intro
(01:32) Summary
(03:00) R hyperparameter sweep
(06:44) Concept-specific steering vectors
(07:02) Red teaming
(11:11) The Negation Non-Refusal
(11:56) Non-English language vectors
(13:06) Mental health/suicide prevention
(15:08) Github repo
(15:17) Possible Followups
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongEpistemic status: My coauthor and I are both noobs in this field. Expect errors and conceptual flaws.
tl;dr
For a four-day capstone project for the ARENA program my partner and I did a replication of the MELBO Lesswrong article using Llama-3.2-1b-Instruct.
Intro
I've been spending the last month at the ARENA program developing some basic technical skills to match my long-standing layman's interest in AI and AI X-risk. And ARENA's final week is a capstone project asking us to engage with current literature and do something vaguely novel with it over the span of about a week.
This being a weeklong project, we needed to scope it very tightly. So my partner and I chose to replicate MELBO on a different LLM! The project of "finding, in an unsupervised way, plausible out-of-distribution model behavior" seemed interesting for its own sake and also possibly useful for the task [...]
---
Outline:
(00:21) tl;dr
(00:35) Intro
(01:32) Summary
(03:00) R hyperparameter sweep
(06:44) Concept-specific steering vectors
(07:02) Red teaming
(11:11) The Negation Non-Refusal
(11:56) Non-English language vectors
(13:06) Mental health/suicide prevention
(15:08) Github repo
(15:17) Possible Followups
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,856 Listeners

130 Listeners

7,217 Listeners

532 Listeners

16,202 Listeners

4 Listeners

14 Listeners

2 Listeners