April 05, 2026

Shaping capabilities with token-level data filtering

23 minutes

In this episode:
• Welcome and the Post-hoc Problem: Linda introduces the paper and the hosts discuss why post-hoc unlearning methods fall short against adversarial attacks.
• Token vs. Document Filtering: An exploration of why token-level filtering acts as a scalpel compared to the blunt instrument of document filtering.
• Scaling Labels with SAEs: The hosts discuss how the authors use Sparse Autoencoders to label a subset of data and distill that into a highly efficient biLM classifier.
• Scaling Laws and Adversarial Robustness: Linda reveals the massive 7000x compute slowdown on the forget domain and how it compares to state-of-the-art unlearning.
• The Alignment Twist: A surprising finding that token-level filtering actually improves a model's ability to be trained to refuse dangerous prompts.

...more

View all episodes

By Mechanical Dirk

April 05, 2026

Shaping capabilities with token-level data filtering

23 minutes

...more

Share Shaping capabilities with token-level data filtering

Sign up to save your podcasts

Shaping capabilities with token-level data filtering

Shaping capabilities with token-level data filtering