We present and formally deprecate WoFBench, a novel test that compares
the knowledge of Wings of Fire superfans to frontier AI models. The
benchmark showed initial promise as a challenging evaluation, but
unfortunately proved to be saturated on creation as AI models and
superfans produced output that was, to the extent of our ability to
score responses, statistically indistinguishable from entirely
correct.
Benchmarks are important tools for tracking the rapid advancements in
model capabilities, but they are struggling to keep up with LLM
progress: frontier models now consistently
achieve
high scores on many popular benchmarks, raising questions about their
continued ability to differentiate between models.
In response, we introduce WoFBench, an evaluation suite designed to
test recall and knowledge synthesis in the domain of Tui
T. Sutherland's Wings of Fire universe.
The superfans were identified via a careful search process, in which
all members of the lead author's household were asked to complete a
self-assessment of their knowledge of the Wings of Fire universe. The
assessment consisted of a single question, with the text "do you think
you know the Wings of Fire universe better than Gemini?" Two
superfans were identified, who we keep [...]
---
https://www.lesswrong.com/posts/YshqDtyzgWaJxthTo/introducing-and-deprecating-wofbench
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.