March 01, 2026

“Introducing and Deprecating WoFBench” by jefftk

5 minutes

We present and formally deprecate WoFBench, a novel test that compares

the knowledge of Wings of Fire superfans to frontier AI models. The

benchmark showed initial promise as a challenging evaluation, but

unfortunately proved to be saturated on creation as AI models and

superfans produced output that was, to the extent of our ability to

score responses, statistically indistinguishable from entirely

correct.

Benchmarks are important tools for tracking the rapid advancements in

model capabilities, but they are struggling to keep up with LLM

progress: frontier models now consistently

achieve

high scores on many popular benchmarks, raising questions about their

continued ability to differentiate between models.

In response, we introduce WoFBench, an evaluation suite designed to

test recall and knowledge synthesis in the domain of Tui

T. Sutherland's Wings of Fire universe.

The superfans were identified via a careful search process, in which

all members of the lead author's household were asked to complete a

self-assessment of their knowledge of the Wings of Fire universe. The

assessment consisted of a single question, with the text "do you think

you know the Wings of Fire universe better than Gemini?" Two

superfans were identified, who we keep [...]

---

First published:

March 1st, 2026

Source:

https://www.lesswrong.com/posts/YshqDtyzgWaJxthTo/introducing-and-deprecating-wofbench

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 01, 2026

“Introducing and Deprecating WoFBench” by jefftk

5 minutes

We present and formally deprecate WoFBench, a novel test that compares

the knowledge of Wings of Fire superfans to frontier AI models. The

benchmark showed initial promise as a challenging evaluation, but

unfortunately proved to be saturated on creation as AI models and

superfans produced output that was, to the extent of our ability to

score responses, statistically indistinguishable from entirely

correct.

Benchmarks are important tools for tracking the rapid advancements in

model capabilities, but they are struggling to keep up with LLM

progress: frontier models now consistently

achieve

high scores on many popular benchmarks, raising questions about their

continued ability to differentiate between models.

In response, we introduce WoFBench, an evaluation suite designed to

test recall and knowledge synthesis in the domain of Tui

T. Sutherland's Wings of Fire universe.

The superfans were identified via a careful search process, in which

all members of the lead author's household were asked to complete a

self-assessment of their knowledge of the Wings of Fire universe. The

assessment consisted of a single question, with the text "do you think

you know the Wings of Fire universe better than Gemini?" Two

superfans were identified, who we keep [...]

---

First published:

March 1st, 2026

Source:

https://www.lesswrong.com/posts/YshqDtyzgWaJxthTo/introducing-and-deprecating-wofbench

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

113,257 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,261 Listeners

Dwarkesh Podcast

564 Listeners

The Ezra Klein Show

16,482 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Introducing and Deprecating WoFBench” by jefftk

Sign up to save your podcasts

“Introducing and Deprecating WoFBench” by jefftk

“Introducing and Deprecating WoFBench” by jefftk

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi