Evaluation Command
Evaluation Strategies
stepwise
Evaluates using all screenshots step by step:final
Evaluates only the final screenshot for efficiency:Score Threshold
LexBench-Browser uses a 0-100 scoring system:Evaluate Data Subset
Results
Results are saved inexperiments/LexBench-Browser/<Agent>/<Timestamp>/tasks_eval_result/.
Each task result contains:
predicted_label: 1 = success, 0 = failurescore: 0-100grader_response: Detailed evaluation responsefailure_category: Failure category (if applicable)