Agent Benchmarks
Standardized measures of AI agent capability — from software engineering to web navigation.
┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Benchmark Landscape │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CODE & SOFTWARE WEB & NAVIGATION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ SWE-bench Verified │ │ WebArena │ │ │ │ SWE-bench Pro │ │ WebArena-Verified │ │ │ │ Terminal-Bench Hard│ │ WebChoreArena │ │ │ │ LiveCodeBench │ │ │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ GENERAL & WORKPLACE TOOL & CONVERSATION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ GAIA │ │ τ²-bench │ │ │ │ GDPval-AA │ │ AssistantBench │ │ │ │ TheAgentCompany │ │ ToolBench │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ Human Baseline ████████████████████████████████████████████ 78-95% │ │ Current SOTA ██████████████████████████████████ 30-92% │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Current State of the Art (2026)
| Benchmark | Top Score | Leading System | Human Baseline |
|---|---|---|---|
| SWE-bench Verified | 80.9% | Claude Opus 4.5 / 4.6 | ~100% |
| SWE-bench Pro | 57.7% | GPT-5.4 | ~100% |
| GDPval-AA | 1667 ELO | GPT-5.4 | — |
| Terminal-Bench Hard | 57.6% | GPT-5.4 | — |
| WebArena | ~62% | IBM CUGA | 78% |
| TheAgentCompany | ~30% | Various | ~90% |
Software Engineering
Human-verified subset of 500 GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites. Widely cited but facing contamination concerns — see note below.
Successor to SWE-bench Verified with 1,865 uncontaminated tasks across Python, Go, TypeScript, and JavaScript in 41 actively maintained repositories. Tasks require fixing bugs or implementing features without breaking existing tests.
Agentic benchmark from Stanford and the Laude Institute evaluating AI in terminal environments. Tasks include compiling code, training models, configuring servers, and debugging systems in Docker containers.
Contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces. Tests code generation, self-repair, and execution prediction.
Web Navigation
Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia). Agents must complete realistic multi-step web tasks.
General Assistant
General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning across 3 difficulty levels.
Real-world agentic work tasks across 44 occupations and 9 industries. Models get shell access and web browsing to produce documents, slides, diagrams, and spreadsheets. Scored via blind pairwise ELO comparisons.
Tool + Conversation
Dual-control conversational AI benchmark from Sierra Research simulating technical support scenarios. Both agent and user modify shared state, testing problem-solving and communication in telecom domain.
Workplace Simulation
Workplace simulation from CMU with 175 tasks in a simulated software company. Agents browse the web, write code, run programs, and communicate with simulated coworkers via Slack, GitLab, and more.
Evaluation Methodology
Rigorous benchmark evaluation requires deterministic execution (temperature=0), isolated environments, and statistical confidence intervals. A common error is reporting accuracy without confidence bounds — on small benchmarks like TheAgentCompany (175 tasks), a 2-point difference may not be statistically significant.