Claw-Eval: Toward Trustworthy and Transparent Evaluation of Autonomous Agents
Benchmark with 2,159 rubric items across 300 tasks using trajectory-aware grading and 3-trial Pass^3 scoring to mitigate luck. Evaluates agent reliability in real-world robotics settings.
Claw-Eval: Toward Trustworthy and Transparent Evaluation of Autonomous Agents
Benchmark with 2,159 rubric items across 300 tasks using trajectory-aware grading and 3-trial Pass^3 scoring to mitigate luck. Evaluates agent reliability in real-world robotics settings.