Skip to main content
This page explains how to integrate a new benchmark into browseruse-bench.

Directory structure

Each benchmark should follow this layout:
benchmarks/
`-- <YourBenchmark>/
    |-- README.md           # Benchmark documentation
    |-- data/
    |   `-- tasks.json      # Task data
    |-- data_info.json      # Version and split metadata
    `-- evaluator.py        # Optional evaluator

Step 1: Create task data

tasks.json format

[
  {
    "task_id": "your_task_001",
    "task": "Task description shown to the agent",
    "website": "example.com",
    "category": "information_retrieval",
    "expected_result": "Expected result description",
    "requires_login": false,
    "difficulty": "easy"
  },
  {
    "task_id": "your_task_002",
    "task": "Another task description",
    "website": "example.org",
    "category": "form_filling",
    "expected_result": "Form submission successful",
    "requires_login": true,
    "difficulty": "medium"
  }
]

Required fields

FieldTypeDescription
task_idstringUnique task ID
taskstringTask description

Optional fields

FieldTypeDescription
websitestringTarget website
categorystringTask category
expected_resultstringExpected outcome
requires_loginbooleanWhether login is required
difficultystringDifficulty level

Step 2: Create the data info file

data_info.json

{
  "name": "YourBenchmark",
  "description": "Your benchmark description",
  "default_version": "20251231",
  "versions": {
    "20251231": {
      "data_file": "data/tasks.json",
      "splits": {
        "All": null,
        "Easy": {"difficulty": "easy"},
        "Medium": {"difficulty": "medium"},
        "Hard": {"difficulty": "hard"}
      }
    }
  }
}

Split filter rules

  • null: include all tasks
  • {"field": "value"}: filter by field value
  • {"field": ["v1", "v2"]}: filter by a list of values

Step 3: Create an evaluator (optional)

If you need custom evaluation logic, add evaluator.py:
from browseruse_bench.eval.base_evaluator import BaseEvaluator

class YourBenchmarkEvaluator(BaseEvaluator):
    """Custom evaluator."""

    def evaluate_task(self, task_id: str, result: dict) -> dict:
        """
        Evaluate a single task.

        Args:
            task_id: Task ID
            result: Agent result

        Returns:
            Evaluation result containing predicted_label and score.
        """
        score = self._calculate_score(result)

        return {
            "predicted_label": 1 if score >= 60 else 0,
            "score": score,
            "evaluation_details": {
                "grader_response": "Evaluation details..."
            }
        }

Step 4: Register the benchmark

Register in browseruse_bench/benchmarks/__init__.py:
BENCHMARKS = {
    "LexBench-Browser": ...,
    "Online-Mind2Web": ...,
    "BrowseComp": ...,
    "YourBenchmark": {
        "data_dir": "benchmarks/YourBenchmark",
        "evaluator": "browseruse_bench.eval.your_evaluator:YourBenchmarkEvaluator"
    }
}

Step 5: Test

# Test run
uv run scripts/run.py \
  --agent browser-use \
  --benchmark YourBenchmark \
  --mode first_n --count 3

# Test evaluation
uv run scripts/eval.py \
  --agent browser-use \
  --benchmark YourBenchmark

Full examples

Reference existing benchmarks:
  • benchmarks/LexBench-Browser/ - Full benchmark implementation
  • benchmarks/Online-Mind2Web/ - Mind2Web integration example
  • benchmarks/BrowseComp/ - Simple benchmark example