TYPE III AUDIO (All episodes)

Interpretability in the wild and other papers


Listen Later

---
client: t3a
feed_id: ai_safety_abstracts
narrator: ai
---

This episode covers 3 abstracts:

  1. Active reward learning from multiple teachers - Peter Barnett et al. 
  2. Conditioning Predictive Models: Risks and Strategies - Hubinger et al.
  3. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT2 small - Kevin Wang et al.

Share feedback on this narration.

...more
View all episodesView all episodes
Download on the App Store

TYPE III AUDIO (All episodes)By TYPE III AUDIO