
Sign up to save your podcasts
Or


There's a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:
It's nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign that the conclusion is solid!
…Haha, just kidding! I’ll argue that these three bullet points are hiding three totally different stories. The first two bullets are about training efficiency, and I’ll argue that both are deeply misleading (each for a different reason!). The third is about inference efficiency, which I think is right, and mostly explained by distillation of ever-better frontier models into their “mini” cousins.
sourceTl;dr / outline
---
Outline:
(01:48) Tl;dr / outline
(04:00) Status of this post
(04:12) 1. The big picture of LLM algorithmic progress, as I understand it right now
(04:19) 1.1. Stereotypical algorithmic efficiency improvements: there's the Transformer itself, and ... well, actually, not much else to speak of
(06:36) 1.2. Optimizations: Let's say up to 20×, but there's a ceiling
(07:47) 1.3. Data-related improvements
(09:40) 1.4. Algorithmic changes that are not really quantifiable as efficiency
(10:25) 2. Explaining away the two training-efficiency exponential claims
(10:46) 2.1. The Epoch 8-month halving time claim is a weird artifact of their methodology
(12:33) 2.2. The Dario 4x/year claim is I think just confused
(17:28) 3. Sanity-check: nanochat
(19:43) 4. Optional bonus section: why does this matter?
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThere's a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:
It's nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign that the conclusion is solid!
…Haha, just kidding! I’ll argue that these three bullet points are hiding three totally different stories. The first two bullets are about training efficiency, and I’ll argue that both are deeply misleading (each for a different reason!). The third is about inference efficiency, which I think is right, and mostly explained by distillation of ever-better frontier models into their “mini” cousins.
sourceTl;dr / outline
---
Outline:
(01:48) Tl;dr / outline
(04:00) Status of this post
(04:12) 1. The big picture of LLM algorithmic progress, as I understand it right now
(04:19) 1.1. Stereotypical algorithmic efficiency improvements: there's the Transformer itself, and ... well, actually, not much else to speak of
(06:36) 1.2. Optimizations: Let's say up to 20×, but there's a ceiling
(07:47) 1.3. Data-related improvements
(09:40) 1.4. Algorithmic changes that are not really quantifiable as efficiency
(10:25) 2. Explaining away the two training-efficiency exponential claims
(10:46) 2.1. The Epoch 8-month halving time claim is a weird artifact of their methodology
(12:33) 2.2. The Dario 4x/year claim is I think just confused
(17:28) 3. Sanity-check: nanochat
(19:43) 4. Optional bonus section: why does this matter?
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

113,122 Listeners

132 Listeners

7,266 Listeners

529 Listeners

16,315 Listeners

4 Listeners

14 Listeners

2 Listeners