Time horizon vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8].
Key finding: WeirdML time horizons roughly double every 5 months, from ~24 minutes (GPT-4, June 2023) to ~38 hours (Claude Opus 4.6, February 2026).
ModelReleaseTime horizon (95% CI)Claude Opus 4.6 (adaptive)Feb 202637.7 h [21.6 h, 62.4 h]GPT-5.2 (xhigh)Dec 202530.6 h [18.3 h, 54.4 h]Gemini 3 Pro (high)Nov 202522.3 h [14.4 h, 36.2 h]GPT-5 (high)Aug 202514.5 h [8.6 h, 24.1 h]o3-pro (high)Jun 202511.8 h [7.2 h, 18.9 h]o4-mini (high)Apr 20258.4 h [5.8 h, 13.6 h]o1-previewSep 20246.2 h [4.2 h, 10.5 h]Claude 3.5 SonnetJun 20241.9 h [59 min, 3.5 h]Claude 3 OpusMar 20241.1 h [16 min, 2.3 h]GPT-4Jun 202324 min [4 min, 51 min]
Inspired by METR's work on AI time-horizons (paper) I wanted to do the same for my WeirdML data. WeirdML is my benchmark — supported by METR and included in the Epoch AI benchmarking hub and Epoch Capabilities Index — asking LLMs to solve weird and unusual ML tasks (for more details see the WeirdML page).
Lacking the resources to pay [...]
---
Outline:
(03:23) LLM-predicted human completion times
(04:47) Results calibrated on my completion time predictions
(05:36) Consistency of time-horizons for different thresholds
(09:17) Discussion
(11:42) Implementation details
(11:53) Logistic function fits
(13:38) Task-based bootstrap
(14:04) Trend fit
(14:28) Full prompt for human completion time prediction
---