danielhuber.dev@proton.me Sunday, February 22, 2026

Agent Benchmarks

Standardized measures of AI agent capability — from software engineering to web navigation.


┌─────────────────────────────────────────────────────────────────────────────┐
│                        Agent Benchmark Landscape                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CODE & SOFTWARE                    WEB & NAVIGATION                        │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ SWE-bench          │             │ WebArena           │                  │
│  │ HumanEval          │             │ MiniWoB++          │                  │
│  │ MBPP               │             │ Mind2Web           │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  GENERAL ASSISTANT                  TOOL & FUNCTION                         │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ GAIA               │             │ τ-bench            │                  │
│  │ AgentBench         │             │ API-Bank           │                  │
│  │ AssistantBench     │             │ ToolBench          │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  Human Baseline ████████████████████████████████████████████ 78-95%        │
│  Current SOTA   ██████████████████████                       40-72%        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Current State of the Art (2025)

Human vs AI Gap
The gap between human performance and AI agent performance remains significant on most benchmarks, especially those requiring long-horizon planning, error recovery, and real-world interaction.
Current state of the art vs human baseline
Benchmark Top Score Leading System Human Baseline
SWE-bench Verified 72% Claude 3.5 + Devin ~100%
GAIA (L1) 75% Various 92%
WebArena 42% GPT-4V + SoM 78%
HumanEval 95%+ Various ~100%
72%
top score

Gold standard for code agents. 500 human-verified GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites.

500 tasks · Best: Claude 3.5 Sonnet + Devin
51%
top score

Complete dataset of 2,294 real GitHub issues. More diverse but noisier than Verified subset.

2,294 tasks · Best: Various
42%
top score

Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia).

812 tasks · Best: GPT-4V + SoM
GAIA Medium-Hard
75% (L1)
top score

General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning.

466 tasks · Best: Various
45% (Retail)
top score

Multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.

680 tasks · Best: GPT-4
Model-dependent
top score

Multi-dimensional evaluation across 8 environments: OS, DB, KG, Digital Cards, Lateral Thinking, House-Holding, Web Shopping, Web Browsing.

1,632 tasks · Best: GPT-4
95%+
top score

Code generation benchmark. 164 Python programming problems testing functional correctness.

164 tasks · Best: Various
MBPP Easy-Medium
85%+
top score

Mostly Basic Python Problems. 974 entry-level programming tasks.

974 tasks · Best: Various

Evaluation Methodology

Rigorous benchmark evaluation requires deterministic execution (temperature=0), isolated environments, and statistical confidence intervals. A common error is reporting accuracy without confidence bounds — on small benchmarks like HumanEval (164 tasks), a 2-point difference may not be statistically significant.

Wilson Score Intervals
Always report 95% confidence intervals alongside benchmark scores. The Wilson score interval is preferred over the normal approximation for proportions near 0 or 1.