April 12, 2026

ClawBench for Real-World Online AI Agents

This episode explores ClawBench, a benchmark designed to test whether frontier AI agents can reliably complete real everyday online tasks on live production websites rather than simplified sandbox versions. It explains why real-world web use is much harder than static benchmarks suggest, highlighting obstacles like cookie banners, dynamic pages, login issues, anti-bot friction, and multi-step form filling across 153 tasks on 144 websites in 15 categories such as travel, shopping, job applications, and office admin. The discussion argues that strong language models are not automatically strong agents, because closed-loop browser interaction demands recovery from errors, state tracking, and precise action selection in messy environments. Listeners would find it interesting for its look at the tradeoff between realism, safety, and reproducibility, including ClawBench’s submission-blocking safety layer and agent-based evaluator for scoring complex live-web workflows.

Sources:

1. ClawBench: Can AI Agents Complete Everyday Online Tasks? — Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen, 2026

http://arxiv.org/abs/2604.08523

2. WebArena: A Realistic Web Environment for Building Autonomous Agents — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Ece Kamar, Graham Neubig, and others, 2024

https://scholar.google.com/scholar?q=WebArena%3A+A+Realistic+Web+Environment+for+Building+Autonomous+Agents

3. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks — Jiajie Koh, Shuyan Zhou, Mohit Bansal, and collaborators, 2024

https://scholar.google.com/scholar?q=VisualWebArena%3A+Evaluating+Multimodal+Agents+on+Realistic+Visual+Web+Tasks

4. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models — Xiang Deng He and collaborators, 2024

https://scholar.google.com/scholar?q=WebVoyager%3A+Building+an+End-to-End+Web+Agent+with+Large+Multimodal+Models

5. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Chaoyou Xie, Zeyi Lin, and collaborators, 2024

https://scholar.google.com/scholar?q=OSWorld%3A+Benchmarking+Multimodal+Agents+for+Open-Ended+Tasks+in+Real+Computer+Environments

6. OSWorld — Xie et al., 2024

https://scholar.google.com/scholar?q=OSWorld

7. WebVoyager — He et al., 2024

https://scholar.google.com/scholar?q=WebVoyager

8. AssistantBench — Yoran et al., 2024

https://scholar.google.com/scholar?q=AssistantBench

9. Online-Mind2Web — Xue et al., 2025

https://scholar.google.com/scholar?q=Online-Mind2Web

10. Claw-Eval — Ye et al., 2026

https://scholar.google.com/scholar?q=Claw-Eval

11. TheAgentCompany — Xu et al., 2025

https://scholar.google.com/scholar?q=TheAgentCompany

12. REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites — approx. recent web-agent benchmark authors, 2025/2026

https://scholar.google.com/scholar?q=REAL%3A+Benchmarking+Autonomous+Agents+on+Deterministic+Simulations+of+Real+Websites

13. Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation — approx. recent web-agent/world-model authors, 2025/2026

https://scholar.google.com/scholar?q=Web+Agents+with+World+Models%3A+Learning+and+Leveraging+Environment+Dynamics+in+Web+Navigation

14. DynaWeb: Model-Based Reinforcement Learning of Web Agents — approx. recent model-based RL for web-agent authors, 2025/2026

https://scholar.google.com/scholar?q=DynaWeb%3A+Model-Based+Reinforcement+Learning+of+Web+Agents

15. Privacy Practices of Browser Agents — approx. recent security/privacy researchers, 2025/2026

https://scholar.google.com/scholar?q=Privacy+Practices+of+Browser+Agents

16. The Hidden Dangers of Browsing AI Agents — approx. recent browser-agent security authors, 2025/2026

https://scholar.google.com/scholar?q=The+Hidden+Dangers+of+Browsing+AI+Agents

17. Building Browser Agents: Architecture, Security, and Practical Solutions — approx. recent practitioner/research authors, 2025/2026

https://scholar.google.com/scholar?q=Building+Browser+Agents%3A+Architecture%2C+Security%2C+and+Practical+Solutions

18. Judge Reliability Harness: Stress Testing the Reliability of LLM Judges — approx. recent evaluation researchers, 2025/2026

https://scholar.google.com/scholar?q=Judge+Reliability+Harness%3A+Stress+Testing+the+Reliability+of+LLM+Judges

19. When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs — approx. recent survey/review authors, 2025/2026

https://scholar.google.com/scholar?q=When+AIs+Judge+AIs%3A+The+Rise+of+Agent-as-a-Judge+Evaluation+for+LLMs

20. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3

21. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3

22. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3

Interactive Visualization: ClawBench for Real-World Online AI Agents

...more

View all episodes

By mcgrof