
Sign up to save your podcasts
Or


Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Paper | Github | Demo Notebook
This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.
Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.WeightDiffQA
Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:
Given a language model _M_, a weight diff _delta_, and a natural language question _q_ about _delta_, output a correct natural language answer to _q_.
Here, a "weight [...]
---
Outline:
(00:57) WeightDiffQA
(03:17) Diff Interpretation Tuning
(05:09) Eval #1: Reporting hidden behaviors
(07:17) Eval #2: Summarizing finetuned knowledge
(08:26) Limitations
(09:50) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Paper | Github | Demo Notebook
This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.
Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.WeightDiffQA
Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:
Given a language model _M_, a weight diff _delta_, and a natural language question _q_ about _delta_, output a correct natural language answer to _q_.
Here, a "weight [...]
---
Outline:
(00:57) WeightDiffQA
(03:17) Diff Interpretation Tuning
(05:09) Eval #1: Reporting hidden behaviors
(07:17) Eval #2: Summarizing finetuned knowledge
(08:26) Limitations
(09:50) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,320 Listeners

2,451 Listeners

8,549 Listeners

4,178 Listeners

93 Listeners

1,601 Listeners

9,922 Listeners

95 Listeners

512 Listeners

5,507 Listeners

15,930 Listeners

547 Listeners

130 Listeners

93 Listeners

467 Listeners