LessWrong (30+ Karma)

“How Can Interpretability Researchers Help AGI Go Well?” by Neel Nanda


Listen Later

Executive Summary

  • Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post
  • [[1]]
    , and are excited for more in the field to embrace pragmatism! In brief, we think that:
    • It is crucial to have empirical feedback on your ultimate goal with good proxy tasks
    • [[2]]
      .
    • We do not need near-complete understanding to have significant impact.
    • We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
    • But that's pretty abstract. So how can interpretability help AGI go well? A few theories of change stand out to us:
      • Science of Misalignment: If a model takes a harmful action, we want to be able to rigorously determine whether it was “scheming” or just “confused”
      • [[3]]
      • Empowering Other Areas Of Safety: Interpretability is not a silver bullet that will solve safety by itself, but can significantly help other areas by unblocking things or addressing weak points where appropriate, e.g. suppressing eval awareness, or interpreting what safety techniques taught a model.
      • [...]

        ---

        Outline:

        (00:11) Executive Summary

        (02:57) Theories Of Change

        (04:25) Science Of Misalignment

        (06:59) Empowering Other Areas Of AGI Safety

        (07:17) Evaluation Awareness

        (07:53) Better Feedback On Safety Research

        (08:11) Conceptual Progress On Model Psychology

        (08:44) Maintaining Monitorability Of Chain Of Thought

        (09:20) Preventing Egregiously Misaligned Actions

        (11:20) Directly Helping Align Models

        (13:30) Research Areas Directed Towards Theories Of Change

        (13:53) Model Biology

        (15:48) Helping Direct Model Training

        (15:55) Monitoring

        (17:10) Research Areas About Robustly Useful Settings

        (18:00) Reasoning Model Interpretability

        (22:42) Automating Interpretability

        (23:51) Basic Science Of AI Psychology

        (24:08) Finding Good Proxy Tasks

        (25:05) Discovering Unusual Behaviours

        (26:12) Data-Centric Interpretability

        (27:16) Model Diffing

        (28:05) Applied Interpretability

        (29:01) Appendix: Motivating Example For Why Reasoning Model Interp Breaks Standard Techniques {#appendix:-motivating-example-for-why-reasoning-model-interp-breaks-standard-techniques}

        The original text contained 16 footnotes which were omitted from this narration.

        ---

        First published:

        December 1st, 2025

        Source:

        https://www.lesswrong.com/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well

        ---

        Narrated by TYPE III AUDIO.

        ...more
        View all episodesView all episodes
        Download on the App Store

        LessWrong (30+ Karma)By LessWrong


        More shows like LessWrong (30+ Karma)

        View all
        Making Sense with Sam Harris by Sam Harris

        Making Sense with Sam Harris

        26,352 Listeners

        Conversations with Tyler by Mercatus Center at George Mason University

        Conversations with Tyler

        2,454 Listeners

        The Peter Attia Drive by Peter Attia, MD

        The Peter Attia Drive

        8,651 Listeners

        Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

        Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

        4,175 Listeners

        ManifoldOne by Steve Hsu

        ManifoldOne

        93 Listeners

        Your Undivided Attention by The Center for Humane Technology, Tristan Harris, Daniel Barcay and Aza Raskin

        Your Undivided Attention

        1,600 Listeners

        All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

        All-In with Chamath, Jason, Sacks & Friedberg

        9,901 Listeners

        Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

        Machine Learning Street Talk (MLST)

        93 Listeners

        Dwarkesh Podcast by Dwarkesh Patel

        Dwarkesh Podcast

        505 Listeners

        Hard Fork by The New York Times

        Hard Fork

        5,530 Listeners

        The Ezra Klein Show by New York Times Opinion

        The Ezra Klein Show

        15,995 Listeners

        Moonshots with Peter Diamandis by PHD Ventures

        Moonshots with Peter Diamandis

        542 Listeners

        No Priors: Artificial Intelligence | Technology | Startups by Conviction

        No Priors: Artificial Intelligence | Technology | Startups

        135 Listeners

        Latent Space: The AI Engineer Podcast by swyx + Alessio

        Latent Space: The AI Engineer Podcast

        95 Listeners

        BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

        BG2Pod with Brad Gerstner and Bill Gurley

        474 Listeners