Directory structure
Each benchmark should follow this layout:Step 1: Create task data
tasks.json format
Required fields
| Field | Type | Description |
|---|---|---|
task_id | string | Unique task ID |
task | string | Task description |
Optional fields
| Field | Type | Description |
|---|---|---|
website | string | Target website |
category | string | Task category |
expected_result | string | Expected outcome |
requires_login | boolean | Whether login is required |
difficulty | string | Difficulty level |
Step 2: Create the data info file
data_info.json
Split filter rules
null: include all tasks{"field": "value"}: filter by field value{"field": ["v1", "v2"]}: filter by a list of values
Step 3: Create an evaluator (optional)
If you need custom evaluation logic, addevaluator.py:
Step 4: Register the benchmark
Register inbrowseruse_bench/benchmarks/__init__.py:
Step 5: Test
Full examples
Reference existing benchmarks:benchmarks/LexBench-Browser/- Full benchmark implementationbenchmarks/Online-Mind2Web/- Mind2Web integration examplebenchmarks/BrowseComp/- Simple benchmark example