AI Post Transformers

ClawBench for Real-World Online AI Agents


Listen Later

This episode explores ClawBench, a benchmark designed to test whether frontier AI agents can reliably complete real everyday online tasks on live production websites rather than simplified sandbox versions. It explains why real-world web use is much harder than static benchmarks suggest, highlighting obstacles like cookie banners, dynamic pages, login issues, anti-bot friction, and multi-step form filling across 153 tasks on 144 websites in 15 categories such as travel, shopping, job applications, and office admin. The discussion argues that strong language models are not automatically strong agents, because closed-loop browser interaction demands recovery from errors, state tracking, and precise action selection in messy environments. Listeners would find it interesting for its look at the tradeoff between realism, safety, and reproducibility, including ClawBench’s submission-blocking safety layer and agent-based evaluator for scoring complex live-web workflows.
Sources:
1. ClawBench: Can AI Agents Complete Everyday Online Tasks? — Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen, 2026
http://arxiv.org/abs/2604.08523
2. WebArena: A Realistic Web Environment for Building Autonomous Agents — Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Ece Kamar, Graham Neubig, and others, 2024
https://scholar.google.com/scholar?q=WebArena:+A+Realistic+Web+Environment+for+Building+Autonomous+Agents
3. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks — Jiajie Koh, Shuyan Zhou, Mohit Bansal, and collaborators, 2024
https://scholar.google.com/scholar?q=VisualWebArena:+Evaluating+Multimodal+Agents+on+Realistic+Visual+Web+Tasks
4. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models — Xiang Deng He and collaborators, 2024
https://scholar.google.com/scholar?q=WebVoyager:+Building+an+End-to-End+Web+Agent+with+Large+Multimodal+Models
5. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments — Chaoyou Xie, Zeyi Lin, and collaborators, 2024
https://scholar.google.com/scholar?q=OSWorld:+Benchmarking+Multimodal+Agents+for+Open-Ended+Tasks+in+Real+Computer+Environments
6. OSWorld — Xie et al., 2024
https://scholar.google.com/scholar?q=OSWorld
7. WebVoyager — He et al., 2024
https://scholar.google.com/scholar?q=WebVoyager
8. AssistantBench — Yoran et al., 2024
https://scholar.google.com/scholar?q=AssistantBench
9. Online-Mind2Web — Xue et al., 2025
https://scholar.google.com/scholar?q=Online-Mind2Web
10. Claw-Eval — Ye et al., 2026
https://scholar.google.com/scholar?q=Claw-Eval
11. TheAgentCompany — Xu et al., 2025
https://scholar.google.com/scholar?q=TheAgentCompany
12. REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites — approx. recent web-agent benchmark authors, 2025/2026
https://scholar.google.com/scholar?q=REAL:+Benchmarking+Autonomous+Agents+on+Deterministic+Simulations+of+Real+Websites
13. Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation — approx. recent web-agent/world-model authors, 2025/2026
https://scholar.google.com/scholar?q=Web+Agents+with+World+Models:+Learning+and+Leveraging+Environment+Dynamics+in+Web+Navigation
14. DynaWeb: Model-Based Reinforcement Learning of Web Agents — approx. recent model-based RL for web-agent authors, 2025/2026
https://scholar.google.com/scholar?q=DynaWeb:+Model-Based+Reinforcement+Learning+of+Web+Agents
15. Privacy Practices of Browser Agents — approx. recent security/privacy researchers, 2025/2026
https://scholar.google.com/scholar?q=Privacy+Practices+of+Browser+Agents
16. The Hidden Dangers of Browsing AI Agents — approx. recent browser-agent security authors, 2025/2026
https://scholar.google.com/scholar?q=The+Hidden+Dangers+of+Browsing+AI+Agents
17. Building Browser Agents: Architecture, Security, and Practical Solutions — approx. recent practitioner/research authors, 2025/2026
https://scholar.google.com/scholar?q=Building+Browser+Agents:+Architecture,+Security,+and+Practical+Solutions
18. Judge Reliability Harness: Stress Testing the Reliability of LLM Judges — approx. recent evaluation researchers, 2025/2026
https://scholar.google.com/scholar?q=Judge+Reliability+Harness:+Stress+Testing+the+Reliability+of+LLM+Judges
19. When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs — approx. recent survey/review authors, 2025/2026
https://scholar.google.com/scholar?q=When+AIs+Judge+AIs:+The+Rise+of+Agent-as-a-Judge+Evaluation+for+LLMs
20. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3
21. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3
22. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3
Interactive Visualization: ClawBench for Real-World Online AI Agents
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof