Supported Benchmarks
LexBench-Browser
Recommended - Evaluation benchmark designed for Chinese websites, with 340 tasks covering 50+ mainstream sites. Supports no-login subset for automated evaluation.
Online-Mind2Web
Online evaluation based on the Mind2Web dataset, testing agents’ navigation and interaction capabilities on real websites.
BrowseComp
Browser operation competition tasks, evaluating agents’ comprehensive browser operation capabilities.
Feature Comparison
| Benchmark | Tasks | Language | Evaluation | Login Required |
|---|---|---|---|---|
| LexBench-Browser | 340 | Chinese | WebJudge | Partial |
| Online-Mind2Web | ~100 | English | WebJudge | No |
| BrowseComp | ~50 | English | Grader | No |
Quick Comparison Run
Data Location
All benchmark data is stored in thebenchmarks/ directory:
| Benchmark | Data File Path |
|---|---|
| LexBench-Browser | benchmarks/LexBench-Browser/data/tasks.json |
| Online-Mind2Web | benchmarks/Online-Mind2Web/data/Online_Mind2Web.json |
| BrowseComp | benchmarks/BrowseComp/data/tasks.json |
Planned Support
- More benchmarks
If you’d like to add a new benchmark, please refer to the Custom Benchmark guide.