
Sign up to save your podcasts
Or


Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Paper | Github | Demo Notebook
This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.
Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.WeightDiffQA
Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:
Given a language model _M_, a weight diff _delta_, and a natural language question _q_ about _delta_, output a correct natural language answer to _q_.
Here, a "weight [...]
---
Outline:
(00:57) WeightDiffQA
(03:17) Diff Interpretation Tuning
(05:09) Eval #1: Reporting hidden behaviors
(07:17) Eval #2: Summarizing finetuned knowledge
(08:26) Limitations
(09:50) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrong
Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Paper | Github | Demo Notebook
This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.
Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.WeightDiffQA
Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:
Given a language model _M_, a weight diff _delta_, and a natural language question _q_ about _delta_, output a correct natural language answer to _q_.
Here, a "weight [...]
---
Outline:
(00:57) WeightDiffQA
(03:17) Diff Interpretation Tuning
(05:09) Eval #1: Reporting hidden behaviors
(07:17) Eval #2: Summarizing finetuned knowledge
(08:26) Limitations
(09:50) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,214 Listeners

131 Listeners

7,239 Listeners

559 Listeners

16,276 Listeners

4 Listeners

14 Listeners

2 Listeners