danielhuber.dev@proton.me Wednesday, April 8, 2026

Agent Benchmarks

Standardized measures of AI agent capability — from software engineering to web navigation.


┌─────────────────────────────────────────────────────────────────────────────┐
│                        Agent Benchmark Landscape                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CODE & SOFTWARE                    WEB & NAVIGATION                        │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ SWE-bench Verified │             │ WebArena           │                  │
│  │ SWE-bench Pro      │             │ WebArena-Verified  │                  │
│  │ Terminal-Bench Hard│             │ WebChoreArena      │                  │
│  │ LiveCodeBench      │             │                    │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  GENERAL & WORKPLACE                TOOL & CONVERSATION                     │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ GAIA               │             │ τ²-bench           │                  │
│  │ GDPval-AA          │             │ AssistantBench     │                  │
│  │ TheAgentCompany    │             │ ToolBench          │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  Human Baseline ████████████████████████████████████████████ 78-95%        │
│  Current SOTA   ██████████████████████████████████            30-92%       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Current State of the Art (2026)

Human vs AI Gap
Agent performance varies dramatically by task type. Conversational tool use (τ²-bench) has surpassed human parity, while open-ended workplace tasks (TheAgentCompany) and uncontaminated software engineering (SWE-bench Pro) remain well below human performance.
Current state of the art vs human baseline
Benchmark Top Score Leading System Human Baseline
SWE-bench Verified 80.9% Claude Opus 4.5 / 4.6 ~100%
SWE-bench Pro 57.7% GPT-5.4 ~100%
GDPval-AA 1667 ELO GPT-5.4
Terminal-Bench Hard 57.6% GPT-5.4
WebArena ~62% IBM CUGA 78%
TheAgentCompany ~30% Various ~90%
80.9%
top score

Human-verified subset of 500 GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites. Widely cited but facing contamination concerns — see note below.

500 tasks · Best: Claude Opus 4.5 / 4.6
57.7%
top score

Successor to SWE-bench Verified with 1,865 uncontaminated tasks across Python, Go, TypeScript, and JavaScript in 41 actively maintained repositories. Tasks require fixing bugs or implementing features without breaking existing tests.

1,865 tasks · Best: GPT-5.4
57.6%
top score

Agentic benchmark from Stanford and the Laude Institute evaluating AI in terminal environments. Tasks include compiling code, training models, configuring servers, and debugging systems in Docker containers.

323 tasks · Best: GPT-5.4
LiveCodeBench Medium-Hard
91.7%
top score

Contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces. Tests code generation, self-repair, and execution prediction.

500 tasks · Best: Gemini 3 Pro
~62%
top score

Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia). Agents must complete realistic multi-step web tasks.

812 tasks · Best: IBM CUGA
GAIA Medium-Hard
74.5%
top score

General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning across 3 difficulty levels.

466 tasks · Best: HAL Agent (Sonnet 4.5)
1667 ELO
top score

Real-world agentic work tasks across 44 occupations and 9 industries. Models get shell access and web browsing to produce documents, slides, diagrams, and spreadsheets. Scored via blind pairwise ELO comparisons.

220 tasks · Best: GPT-5.4
98.8%
top score

Dual-control conversational AI benchmark from Sierra Research simulating technical support scenarios. Both agent and user modify shared state, testing problem-solving and communication in telecom domain.

327 tasks · Best: GLM-4.7-Flash
~30%
top score

Workplace simulation from CMU with 175 tasks in a simulated software company. Agents browse the web, write code, run programs, and communicate with simulated coworkers via Slack, GitLab, and more.

175 tasks · Best: Various

Evaluation Methodology

Rigorous benchmark evaluation requires deterministic execution (temperature=0), isolated environments, and statistical confidence intervals. A common error is reporting accuracy without confidence bounds — on small benchmarks like TheAgentCompany (175 tasks), a 2-point difference may not be statistically significant.

Wilson Score Intervals
Always report 95% confidence intervals alongside benchmark scores. The Wilson score interval is preferred over the normal approximation for proportions near 0 or 1.