Skip to main content
This page shows how to evaluate BrowseComp results.

Evaluation Command

uv run scripts/eval.py \
  --agent <agent_name> \
  --benchmark BrowseComp \
  [options]

Basic Evaluation

uv run scripts/eval.py \
  --agent browser-use \
  --benchmark BrowseComp

Force Re-evaluation

uv run scripts/eval.py \
  --agent browser-use \
  --benchmark BrowseComp \
  --force-reeval

Results

Results are saved in experiments/BrowseComp/<Agent>/<Timestamp>/tasks_eval_result/. BrowseComp uses a Grader for evaluation. Results contain:
  • predicted_label: 1 = success, 0 = failure
  • grader_response: Evaluation details