October 28, 2025

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

55 minutes

This is a link post.

By Ben Wilson and John Bash from Metaculus

Main Takeaways

Top Findings

Pro forecasters significantly outperform bots: Our team of 10 Metaculus Pro Forecasters demonstrated superior performance compared to the top-10 bot team, with strong statistical significance (p = 0.00001) based on a one-sided t-test on Peer scores.
The bot team did not improve significantly in Q2 relative to the human Pro team: The bot team's head-to-head score against Pros was -11.3 in Q3 2024 (95% CI: [-21.8, -0.7]), then -8.9 in Q4 2024 (95% CI: [-18.8, 1]), then -17.7 in Q1 2025 (95% CI: [-28.3, -7.0]), and now -20.03 [-28.63, -11.41] with no clear trend emerging. (Reminder: a lower head-to-head score indicates worse relative accuracy. A score of 0 corresponds to equal accuracy.)

Other Takeaways

This quarter's winning bot is open-source: Q2 Winner Panshul has very generously made his bot open-source. The bot writes separate “outside view” and “inside view” [...]

---

Outline:

(00:20) Main Takeaways

(03:24) Introduction

(04:30) Methodology

(13:59) How do LLMs Compare?

(17:18) Which Bot Strategy is Best?

(23:04) Are Bots Better than Human Pros?

(25:38) Binary vs Numeric vs Multiple Choice Questions

(27:07) Team Performance Over Quarters

(31:14) Bot Maker Survey

(31:40) Best practices of the best-performing bots

(38:27) Other Survey Results

(41:32) How did scaffolding do?

(45:33) Advice from Bot Makers

(53:48) Links to Code and Data

(54:56) Future AI Benchmarking Tournaments

---

First published:

October 28th, 2025

Source:

https://forum.effectivealtruism.org/posts/F2stjK9wHSy3HPEC9/q2-ai-benchmark-results-pros-maintain-clear-lead

Linkpost URL:
https://www.metaculus.com/notebooks/40456/q2-ai-benchmark-results/

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By EA Forum Team

October 28, 2025

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

55 minutes

This is a link post.

By Ben Wilson and John Bash from Metaculus

Main Takeaways

Top Findings

Pro forecasters significantly outperform bots: Our team of 10 Metaculus Pro Forecasters demonstrated superior performance compared to the top-10 bot team, with strong statistical significance (p = 0.00001) based on a one-sided t-test on Peer scores.
The bot team did not improve significantly in Q2 relative to the human Pro team: The bot team's head-to-head score against Pros was -11.3 in Q3 2024 (95% CI: [-21.8, -0.7]), then -8.9 in Q4 2024 (95% CI: [-18.8, 1]), then -17.7 in Q1 2025 (95% CI: [-28.3, -7.0]), and now -20.03 [-28.63, -11.41] with no clear trend emerging. (Reminder: a lower head-to-head score indicates worse relative accuracy. A score of 0 corresponds to equal accuracy.)

Other Takeaways

This quarter's winning bot is open-source: Q2 Winner Panshul has very generously made his bot open-source. The bot writes separate “outside view” and “inside view” [...]

---

Outline:

(00:20) Main Takeaways

(03:24) Introduction

(04:30) Methodology

(13:59) How do LLMs Compare?

(17:18) Which Bot Strategy is Best?

(23:04) Are Bots Better than Human Pros?

(25:38) Binary vs Numeric vs Multiple Choice Questions

(27:07) Team Performance Over Quarters

(31:14) Bot Maker Survey

(31:40) Best practices of the best-performing bots

(38:27) Other Survey Results

(41:32) How did scaffolding do?

(45:33) Advice from Bot Makers

(53:48) Links to Code and Data

(54:56) Future AI Benchmarking Tournaments

---

First published:

October 28th, 2025

Source:

https://forum.effectivealtruism.org/posts/F2stjK9wHSy3HPEC9/q2-ai-benchmark-results-pros-maintain-clear-lead

Linkpost URL:
https://www.metaculus.com/notebooks/40456/q2-ai-benchmark-results/

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

Share [Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

Sign up to save your podcasts

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus

[Linkpost] “Q2 AI Benchmark Results: Pros Maintain Clear Lead” by Benjamin Wilson 🔸, johnbash, Metaculus