Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Forecasting: Two Years In, published by jsteinhardt on August 20, 2023 on LessWrong.
Two years ago, I commissioned forecasts for state-of-the-art performance on several popular ML benchmarks. Forecasters were asked to predict state-of-the-art performance on June 30th of 2022, 2023, 2024, and 2025. While there were four benchmarks total, the two most notable were MATH (a dataset of free-response math contest problems) and MMLU (a dataset of multiple-choice exams from the high school to post-graduate level).
One year ago, I evaluated the first set of forecasts. Forecasters did poorly and underestimated progress, with the true performance lying in the far right tail of their predicted distributions. Anecdotally, experts I talked to (including myself) also underestimated progress. As a result of this, I decided to join the fray and registered my own forecasts for MATH and MMLU last July.
June 30, 2023 has now passed, so we can resolve the forecasts and evaluate my own performance as well as that of other forecasters, including both AI experts and generalist "superforecasters". I'll evaluate the original forecasters that I commissioned through Hypermind, the crowd forecasting platform Metaculus, and participants in the XPT forecasting competition organized by Karger et al. (2023), which was stratified into AI experts and superforecasters.
Overall, here is how I would summarize the results:
Metaculus and I did the best and were both well-calibrated, with the Metaculus crowd forecast doing slightly better than me.
The AI experts from Karger et al. did the next best. They had similar medians to me but were (probably) overconfident in the tails.
The superforecasters from Karger et al. did the next best. They (probably) systematically underpredicted progress.
The forecasters from Hypermind did the worst. They underpredicted progress significantly on MMLU.
Interestingly, this is a reverse of my impressions from last year, where even though forecasters underpredicted progress, I thought of experts as underpredicting progress even more. In this case, it seems the experts did pretty well and better than generalist forecasters.
What accounts for the difference? Some may be selection effects (experts who try to register forecasts are more likely to be correct). But I'd guess some is also effort: the expert "forecasts" I had in mind last year were from informal hallway conversations, while this year they were formal quantitative predictions with some (small) monetary incentive to be correct. In general, I think we should trust expert predictions more in this setting (relative to their informal statements), and I'm now somewhat more optimistic that experts can give accurate forecasts given a bit of training and the correct incentives.
In the rest of the post, I'll first dive into everyone's forecasts and evaluate each in turn. Then, I'll consider my own forecast in detail, evaluating not just the final answer but the reasoning I used (which was preregistered and can be found here).
My forecasts, and others
As a reminder, forecasts are specified as probability distributions over some (hopefully unambiguously) resolvable future outcome. In this case the outcome was the highest credibly claimed benchmark accuracy by any ML system on the MATH and MMLU benchmarks as of June 30, 2023.
My forecasts from July 17, 2022 are displayed below as probability density functions, as well as cumulative distribution functions and the actual result:
MATHMMLUResult: 69.6% (Lightman et al., 2023)Result: 86.4% (GPT-4)
Orange is my own forecast, while green is the crowd forecast of Metaculus on the same date. For MATH, the true result was at my 41st percentile, while for MMLU it was at my 66th percentile. I slightly overestimated progress on MATH and underestimated MMLU, but both were within my range of e...