Agent Benchmarks
Standardized measures of AI agent capability — from software engineering to web navigation.
┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Benchmark Landscape │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CODE & SOFTWARE WEB & NAVIGATION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ SWE-bench │ │ WebArena │ │ │ │ HumanEval │ │ MiniWoB++ │ │ │ │ MBPP │ │ Mind2Web │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ GENERAL ASSISTANT TOOL & FUNCTION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ GAIA │ │ τ-bench │ │ │ │ AgentBench │ │ API-Bank │ │ │ │ AssistantBench │ │ ToolBench │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ Human Baseline ████████████████████████████████████████████ 78-95% │ │ Current SOTA ██████████████████████ 40-72% │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Current State of the Art (2025)
| Benchmark | Top Score | Leading System | Human Baseline |
|---|---|---|---|
| SWE-bench Verified | 72% | Claude 3.5 + Devin | ~100% |
| GAIA (L1) | 75% | Various | 92% |
| WebArena | 42% | GPT-4V + SoM | 78% |
| HumanEval | 95%+ | Various | ~100% |
Software Engineering
Gold standard for code agents. 500 human-verified GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites.
Complete dataset of 2,294 real GitHub issues. More diverse but noisier than Verified subset.
Web Navigation
Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia).
General Assistant
General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning.
Tool + Conversation
Multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.
Multi-Environment
Multi-dimensional evaluation across 8 environments: OS, DB, KG, Digital Cards, Lateral Thinking, House-Holding, Web Shopping, Web Browsing.
Code Generation
Evaluation Methodology
Rigorous benchmark evaluation requires deterministic execution (temperature=0), isolated environments, and statistical confidence intervals. A common error is reporting accuracy without confidence bounds — on small benchmarks like HumanEval (164 tasks), a 2-point difference may not be statistically significant.