Agent Benchmarks

Standardized measures of AI agent capability — from software engineering to web navigation.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Agent Benchmark Landscape                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CODE & SOFTWARE                    WEB & NAVIGATION                        │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ SWE-bench          │             │ WebArena           │                  │
│  │ HumanEval          │             │ MiniWoB++          │                  │
│  │ MBPP               │             │ Mind2Web           │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  GENERAL ASSISTANT                  TOOL & FUNCTION                         │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ GAIA               │             │ τ-bench            │                  │
│  │ AgentBench         │             │ API-Bank           │                  │
│  │ AssistantBench     │             │ ToolBench          │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  Human Baseline ████████████████████████████████████████████ 78-95%        │
│  Current SOTA   ██████████████████████                       40-72%        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Current State of the Art (2025)

Human vs AI Gap

The gap between human performance and AI agent performance remains significant on most benchmarks, especially those requiring long-horizon planning, error recovery, and real-world interaction.

Current state of the art vs human baseline
Benchmark	Top Score	Leading System	Human Baseline
SWE-bench Verified	72%	Claude 3.5 + Devin	~100%
GAIA (L1)	75%	Various	92%
WebArena	42%	GPT-4V + SoM	78%
HumanEval	95%+	Various	~100%

Software Engineering

SWE-bench Verified Hard

72%

top score

Gold standard for code agents. 500 human-verified GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites.

500 tasks · Best: Claude 3.5 Sonnet + Devin

SWE-bench Full Hard

51%

top score

Complete dataset of 2,294 real GitHub issues. More diverse but noisier than Verified subset.

2,294 tasks · Best: Various

Web Navigation

WebArena Hard

42%

top score

Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia).

812 tasks · Best: GPT-4V + SoM

General Assistant

GAIA Medium-Hard

75% (L1)

top score

General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning.

466 tasks · Best: Various

Tool + Conversation

τ-bench (Tau-bench) Medium

45% (Retail)

top score

Multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.

680 tasks · Best: GPT-4

Multi-Environment

AgentBench Varied

Model-dependent

top score

Multi-dimensional evaluation across 8 environments: OS, DB, KG, Digital Cards, Lateral Thinking, House-Holding, Web Shopping, Web Browsing.

1,632 tasks · Best: GPT-4

Code Generation

HumanEval Medium

95%+

top score

Code generation benchmark. 164 Python programming problems testing functional correctness.

164 tasks · Best: Various

MBPP Easy-Medium

85%+

top score

Mostly Basic Python Problems. 974 entry-level programming tasks.

974 tasks · Best: Various

Evaluation Methodology

Rigorous benchmark evaluation requires deterministic execution (temperature=0), isolated environments, and statistical confidence intervals. A common error is reporting accuracy without confidence bounds — on small benchmarks like HumanEval (164 tasks), a 2-point difference may not be statistically significant.

Wilson Score Intervals

Always report 95% confidence intervals alongside benchmark scores. The Wilson score interval is preferred over the normal approximation for proportions near 0 or 1.