Evaluation & Metrics
Measuring agent performance across component accuracy, task completion, trajectory quality, and system-level metrics with benchmarks and LLM-as-judge.
How do you know if your agent is actually working? Shipping an agent without systematic evaluation means flying blind: you cannot detect regressions, cannot compare approaches, and cannot build confidence that the system will behave as expected on real-world inputs. Evaluation is the discipline of measuring agent performance across multiple dimensions — component accuracy, task completion, trajectory quality, and system-level concerns like cost and safety — so that improvements are grounded in evidence rather than intuition.
Evaluation Taxonomy
Agent evaluation operates at three layers. Component-level metrics measure individual capabilities such as tool calling accuracy and argument extraction. Task-level metrics measure goal achievement — did the agent complete the assigned task, and how efficiently? System-level metrics measure real-world deployment concerns: latency, cost, safety compliance, and user preference. Strong component metrics are necessary but not sufficient for good task metrics, which are necessary but not sufficient for good system metrics.
┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Evaluation Layers │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 3: SYSTEM │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ E2E Latency │ │ Cost/Task │ │ Safety │ │ User Pref │ │ │ │ │ │ P95 < 30s │ │ $/query │ │ Compliance │ │ Win Rate │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 2: TASK │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Completion │ │ Step │ │ Error │ │ Output │ │ │ │ │ │ Rate │ │ Efficiency │ │ Recovery │ │ Quality │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 1: COMPONENT │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Tool Call │ │ Argument │ │ Response │ │ Context │ │ │ │ │ │ Accuracy │ │ Extraction │ │ Format │ │ Utilization │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
| Layer | Metric | Description | Evaluation Method |
|---|---|---|---|
| Component | Tool Calling Accuracy | Correct tool selection rate | Compare to ground truth tool sequence |
| Component | Argument Extraction F1 | Parameter parsing accuracy | Compare extracted args to expected |
| Component | Response Format Validity | Structured output correctness | JSON schema validation |
| Task | Task Completion Rate | Goal achievement percentage | Binary pass/fail per task |
| Task | Step Efficiency | Steps taken vs optimal path | Ratio of actual to optimal steps |
| Task | Error Recovery Rate | Recovery from failures | Track retry success rate |
| System | End-to-End Latency | Total response time | Measure P50, P95, P99 |
| System | Cost per Task | Resource consumption | Tokens/API calls/dollars |
| System | Safety Compliance | Guardrail adherence | Red team testing |
Agent-Specific Metrics with DeepEval
DeepEval is an open-source evaluation framework with metrics specifically designed for agentic systems. Unlike traditional NLP metrics (BLEU, ROUGE) that measure surface-level text similarity, DeepEval captures tool use, grounding, and multi-step reasoning quality.
from deepeval import evaluate
from deepeval.metrics import (
ToolCorrectnessMetric,
FaithfulnessMetric,
AnswerRelevancyMetric,
TaskCompletionMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall
# Define test case for agent evaluation
test_case = LLMTestCase(
input="Find the weather in London and book a restaurant nearby",
actual_output="The weather in London is 15°C with light rain. I found 'The Ivy' restaurant nearby and booked a table for 7pm.",
expected_output="Weather retrieved and restaurant booked successfully",
retrieval_context=[
"London weather: 15°C, light rain, humidity 80%",
"Nearby restaurants: The Ivy (0.3mi), Sketch (0.5mi)"
],
tools_called=[
ToolCall(name="get_weather", args={"city": "London"}),
ToolCall(name="search_restaurants", args={"location": "London", "cuisine": "any"}),
ToolCall(name="book_restaurant", args={"restaurant": "The Ivy", "time": "19:00"})
],
expected_tools=[
ToolCall(name="get_weather", args={"city": "London"}),
ToolCall(name="search_restaurants", args={"location": "London"}),
ToolCall(name="book_restaurant", args={}) # Args can vary
]
)
# Define metrics
metrics = [
ToolCorrectnessMetric(
threshold=0.8,
include_args=True # Also check arguments
),
FaithfulnessMetric(
threshold=0.7,
model="gpt-4" # Judge model
),
AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4"
),
TaskCompletionMetric(
threshold=0.8,
model="gpt-4"
)
]
# Run evaluation
results = evaluate([test_case], metrics)
for metric_result in results:
print(f"{metric_result.name}: {metric_result.score:.2f}")
if metric_result.reason:
print(f" Reason: {metric_result.reason}")
The four core agent metrics are: Tool Correctness (did the agent call the right tools with right arguments), Faithfulness (is the response grounded in retrieved context rather than hallucinated), Answer Relevancy (does the response actually address the user’s question), and Task Completion (did the agent achieve the stated goal, assessed by an LLM judge for nuance beyond binary pass/fail).
Trajectory-Level Evaluation
Beyond final outputs, trajectory evaluation examines the agent’s reasoning path. An agent might reach the correct answer through a convoluted path, wasting tokens and time, or it might succeed on the test case but for the wrong reasons — lucky guesses that will fail on similar inputs.
An agent might reach the correct answer through a convoluted path, wasting tokens and time. Trajectory metrics reveal these inefficiencies that final-answer metrics miss, and detect loops and regressions before they appear in production.
from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np
@dataclass
class TrajectoryStep:
state: Dict[str, Any]
action: str
observation: str
reasoning: str
is_error: bool = False
@dataclass
class AgentTrajectory:
steps: List[TrajectoryStep]
final_result: Any
task_completed: bool
class TrajectoryEvaluator:
def __init__(self, llm_judge=None):
self.llm_judge = llm_judge
def evaluate(self,
trajectory: AgentTrajectory,
optimal_steps: int = None) -> Dict[str, float]:
metrics = {}
# 1. Step efficiency
if optimal_steps:
metrics["step_efficiency"] = min(
1.0, optimal_steps / len(trajectory.steps)
)
# 2. Action diversity (detect loops)
actions = [s.action for s in trajectory.steps]
metrics["action_diversity"] = len(set(actions)) / len(actions)
# 3. Detect repetition (exact action sequences)
metrics["repetition_score"] = self._detect_repetition(actions)
# 4. Error recovery rate
errors = [s for s in trajectory.steps if s.is_error]
if errors:
recoveries = self._count_recoveries(trajectory, errors)
metrics["recovery_rate"] = recoveries / len(errors)
else:
metrics["recovery_rate"] = 1.0 # No errors = perfect
# 5. LLM-as-judge for reasoning quality
if self.llm_judge:
metrics["reasoning_quality"] = self._judge_reasoning(trajectory)
return metrics
def _detect_repetition(self, actions: List[str]) -> float:
"""Return 1.0 if no repetition, lower if patterns repeat."""
if len(actions) < 4:
return 1.0
for pattern_len in [2, 3]:
for i in range(len(actions) - pattern_len * 2):
pattern = actions[i:i + pattern_len]
next_seq = actions[i + pattern_len:i + pattern_len * 2]
if pattern == next_seq:
return 0.5 # Repetition detected
return 1.0
Key trajectory metrics include step efficiency (actual steps versus the known optimal path), action diversity (penalizing repetitive actions that indicate the agent is stuck in a loop), progress rate (how much progress toward the goal each step achieves), error recovery rate (how often the agent successfully recovers after a tool failure or wrong action), and LLM-judged reasoning quality.
Major Agent Benchmarks
Benchmarks provide standardized tasks for comparing agent capabilities across research groups and over time.
| Benchmark | Domain | Tasks | Primary Metric |
|---|---|---|---|
| SWE-bench Verified | Software Engineering | 500 | Resolved Rate |
| SWE-bench Full | Software Engineering | 2,294 | Resolved Rate |
| WebArena | Web Navigation | 812 | Task Success Rate |
| GAIA Level 1 | General Assistant | ~165 | Exact Match |
| GAIA Level 2 | General Assistant | ~186 | Exact Match |
| GAIA Level 3 | General Assistant | ~115 | Exact Match |
| τ-bench | Tool + Conversation | 680 | Pass Rate |
| HumanEval | Code Generation | 164 | Pass@1 |
SWE-bench evaluates agents on real GitHub issues from popular Python repositories like Django, Flask, and scikit-learn. The agent must understand the issue, locate relevant code, and generate a patch that passes the repository’s test suite. SWE-bench Verified uses 500 human-verified issues; SWE-bench Lite contains 300 simpler issues for faster iteration. WebArena tests realistic web navigation across five self-hosted websites including shopping, forums, code hosting, and maps. GAIA (General AI Assistant) poses questions requiring multi-step reasoning across web search, file processing, and calculation, at three difficulty levels. τ-bench evaluates multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.
LLM-as-Judge
Many agent behaviors resist deterministic evaluation. Response helpfulness, reasoning coherence, and instruction-following quality are inherently subjective. LLM-as-Judge uses a capable model to evaluate agent outputs against a rubric, providing more nuanced assessment than exact-match metrics.
┌──────────────────────────────────────────────────────────────────┐ │ LLM-as-Judge Pipeline │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │ │ Agent │ │ Judge │ │ Structured │ │ │ │ Output │───▶│ Prompt │───▶│ Evaluation │ │ │ │ │ │ + Rubric │ │ │ │ │ └─────────────┘ └──────────────┘ │ • Score: 0-1 │ │ │ │ • Reasoning │ │ │ ┌─────────────┐ │ • Suggestions │ │ │ │ Ground │ └─────────────────┘ │ │ │ Truth │────────▶ Compare │ │ │ (optional) │ │ │ └─────────────┘ │ │ │ │ Judge Models: any capable frontier model │ └──────────────────────────────────────────────────────────────────┘
LLM judges have known biases: they prefer verbose responses, may favor their own writing style, and can be fooled by confident-sounding but incorrect answers. Use multiple judges and calibrate against human labels before relying on LLM-as-judge scores.
Evaluation Best Practices
Evaluate at multiple layers rather than relying only on final answer accuracy. Use held-out test sets that were never seen during development. Include adversarial and edge cases — the cases that reveal failure modes, not just average performance. Track metrics over time to detect regressions before they reach production. Combine automated metrics with periodic human evaluation to catch what automated metrics miss. Report confidence intervals, not just point estimates, because many benchmark datasets are small enough that single-run scores have high variance.
Do not overfit to benchmark-specific patterns, ignore trajectory quality, skip safety evaluation, use only synthetic test cases, or assume benchmark scores predict production performance. Benchmarks measure a proxy for real-world capability; the gap between benchmark performance and deployment performance is often substantial and domain-specific.