
Sign up to save your podcasts
Or


The paper introduces DR-Arena, a fully automated evaluation framework designed to assess the performance of Deep Research (DR) agents in dynamic, real-world environments. To overcome the limitations of traditional static benchmarks—such as temporal misalignment with evolving facts and data contamination—DR-Arena constructs Dynamic Information Trees by scraping the live web in real-time.
The framework operates through an automated Examiner that probes two core capabilities: Deep reasoning (multi-hop deduction) and Wide coverage (information gathering and aggregation). A key innovation is the Adaptive Evolvement Loop, a controller that dynamically increases task complexity based on an agent's real-time performance until a decisive capability boundary is identified.
Experimental results involving six state-of-the-art DR agents show that DR-Arena achieves a 0.94 Spearman correlation with human-verified leaderboards like the LMSYS Search Arena. This high level of alignment demonstrates that the framework serves as a scalable and reliable proxy for human adjudication, effectively distinguishing between closely matched models without requiring manual effort.
By Yun WuThe paper introduces DR-Arena, a fully automated evaluation framework designed to assess the performance of Deep Research (DR) agents in dynamic, real-world environments. To overcome the limitations of traditional static benchmarks—such as temporal misalignment with evolving facts and data contamination—DR-Arena constructs Dynamic Information Trees by scraping the live web in real-time.
The framework operates through an automated Examiner that probes two core capabilities: Deep reasoning (multi-hop deduction) and Wide coverage (information gathering and aggregation). A key innovation is the Adaptive Evolvement Loop, a controller that dynamically increases task complexity based on an agent's real-time performance until a decisive capability boundary is identified.
Experimental results involving six state-of-the-art DR agents show that DR-Arena achieves a 0.94 Spearman correlation with human-verified leaderboards like the LMSYS Search Arena. This high level of alignment demonstrates that the framework serves as a scalable and reliable proxy for human adjudication, effectively distinguishing between closely matched models without requiring manual effort.