Epistemic status: This is a quick analysis that might have major mistakes. I currently think there is something real and important here. I’m sharing to elicit feedback and update others insofar as an update is in order, and to learn that I am wrong insofar as that's the case.
Summary
The canonical paper about Algorithmic Progress is by Ho et al. (2024) who find that, historically, the pre-training compute used to reach a particular level of AI capabilities decreases by about 3× each year. Their data covers 2012-2023 and is focused on pre-training.
In this post I look at AI models from 2023-2025 and find that, based on what I think is the most intuitive analysis, catch-up algorithmic progress (including post-training) over this period is something like 16×–60× each year.
This intuitive analysis involves drawing the best-fit line through models that are on the frontier of training-compute efficiency over time, i.e., those that use the least training compute of any model yet to reach or exceed some capability level. I combine Epoch AI's estimates of training compute with model capability scores from Artificial Analysis's Intelligence Index. Each capability level thus yields a slope from its fit line, and these [...]
---
Outline:
(00:29) Summary
(02:37) What do I mean by 'algorithmic progress'?
(06:02) Methods and Results
(08:16) Sanity check: Qwen2.5-72B vs. Qwen3-30B-A3B
(10:09) Discussion
(10:12) How does this compare to the recent analysis in A Rosetta Stone for AI Benchmarks?
(14:47) How does this compare to other previous estimates of algorithmic progress
(17:44) How should we update on this analysis?
(20:13) Appendices
(20:17) Appendix: Filtering by different confidence levels of compute estimates
(20:24) All models
(20:45) Confident compute estimates
(21:07) Appendix: How fast is the cost of AI inference falling?
(23:56) Appendix: Histogram of 1 point buckets
(24:29) Appendix: Qwen2.5 and Qwen3 benchmark performance
(25:31) Appendix Leave-One-Out analysis
(27:08) Appendix: Limitations
(27:13) Outlier models
(29:41) Lack of early, weak models
(30:35) Post-training compute excluded
(31:17) Inference-time compute excluded
(32:16) Some AAII scores are estimates
(32:55) Comparing old and new models on the same benchmark
The original text contained 11 footnotes which were omitted from this narration.
---