Mechanical Dreams

Shaping capabilities with token-level data filtering


Listen Later

In this episode:
• Welcome and the Post-hoc Problem: Linda introduces the paper and the hosts discuss why post-hoc unlearning methods fall short against adversarial attacks.
• Token vs. Document Filtering: An exploration of why token-level filtering acts as a scalpel compared to the blunt instrument of document filtering.
• Scaling Labels with SAEs: The hosts discuss how the authors use Sparse Autoencoders to label a subset of data and distill that into a highly efficient biLM classifier.
• Scaling Laws and Adversarial Robustness: Linda reveals the massive 7000x compute slowdown on the forget domain and how it compares to state-of-the-art unlearning.
• The Alignment Twist: A surprising finding that token-level filtering actually improves a model's ability to be trained to refuse dangerous prompts.
...more
View all episodesView all episodes
Download on the App Store

Mechanical DreamsBy Mechanical Dirk