danielhuber.dev@proton.me Sunday, February 22, 2026

Evaluation & Metrics

Measuring agent performance across component accuracy, task completion, trajectory quality, and system-level metrics with benchmarks and LLM-as-judge.


February 18, 2026

How do you know if your agent is actually working? Shipping an agent without systematic evaluation means flying blind: you cannot detect regressions, cannot compare approaches, and cannot build confidence that the system will behave as expected on real-world inputs. Evaluation is the discipline of measuring agent performance across multiple dimensions — component accuracy, task completion, trajectory quality, and system-level concerns like cost and safety — so that improvements are grounded in evidence rather than intuition.

Evaluation Taxonomy

Agent evaluation operates at three layers. Component-level metrics measure individual capabilities such as tool calling accuracy and argument extraction. Task-level metrics measure goal achievement — did the agent complete the assigned task, and how efficiently? System-level metrics measure real-world deployment concerns: latency, cost, safety compliance, and user preference. Strong component metrics are necessary but not sufficient for good task metrics, which are necessary but not sufficient for good system metrics.

┌─────────────────────────────────────────────────────────────────────────────┐
│                          Agent Evaluation Layers                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 3: SYSTEM                                                     │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ E2E Latency │ │ Cost/Task   │ │ Safety      │ │ User Pref   │    │   │
│  │  │ P95 < 30s   │ │ $/query     │ │ Compliance  │ │ Win Rate    │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ↑                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 2: TASK                                                       │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ Completion  │ │ Step        │ │ Error       │ │ Output      │    │   │
│  │  │ Rate        │ │ Efficiency  │ │ Recovery    │ │ Quality     │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    ↑                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Layer 1: COMPONENT                                                  │   │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │   │
│  │  │ Tool Call   │ │ Argument    │ │ Response    │ │ Context     │    │   │
│  │  │ Accuracy    │ │ Extraction  │ │ Format      │ │ Utilization │    │   │
│  │  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Evaluation metrics organized by layer
LayerMetricDescriptionEvaluation Method
ComponentTool Calling AccuracyCorrect tool selection rateCompare to ground truth tool sequence
ComponentArgument Extraction F1Parameter parsing accuracyCompare extracted args to expected
ComponentResponse Format ValidityStructured output correctnessJSON schema validation
TaskTask Completion RateGoal achievement percentageBinary pass/fail per task
TaskStep EfficiencySteps taken vs optimal pathRatio of actual to optimal steps
TaskError Recovery RateRecovery from failuresTrack retry success rate
SystemEnd-to-End LatencyTotal response timeMeasure P50, P95, P99
SystemCost per TaskResource consumptionTokens/API calls/dollars
SystemSafety ComplianceGuardrail adherenceRed team testing

Agent-Specific Metrics with DeepEval

DeepEval is an open-source evaluation framework with metrics specifically designed for agentic systems. Unlike traditional NLP metrics (BLEU, ROUGE) that measure surface-level text similarity, DeepEval captures tool use, grounding, and multi-step reasoning quality.

from deepeval import evaluate
from deepeval.metrics import (
    ToolCorrectnessMetric,
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    TaskCompletionMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall

# Define test case for agent evaluation
test_case = LLMTestCase(
    input="Find the weather in London and book a restaurant nearby",
    actual_output="The weather in London is 15°C with light rain. I found 'The Ivy' restaurant nearby and booked a table for 7pm.",
    expected_output="Weather retrieved and restaurant booked successfully",
    retrieval_context=[
        "London weather: 15°C, light rain, humidity 80%",
        "Nearby restaurants: The Ivy (0.3mi), Sketch (0.5mi)"
    ],
    tools_called=[
        ToolCall(name="get_weather", args={"city": "London"}),
        ToolCall(name="search_restaurants", args={"location": "London", "cuisine": "any"}),
        ToolCall(name="book_restaurant", args={"restaurant": "The Ivy", "time": "19:00"})
    ],
    expected_tools=[
        ToolCall(name="get_weather", args={"city": "London"}),
        ToolCall(name="search_restaurants", args={"location": "London"}),
        ToolCall(name="book_restaurant", args={})  # Args can vary
    ]
)

# Define metrics
metrics = [
    ToolCorrectnessMetric(
        threshold=0.8,
        include_args=True  # Also check arguments
    ),
    FaithfulnessMetric(
        threshold=0.7,
        model="gpt-4"  # Judge model
    ),
    AnswerRelevancyMetric(
        threshold=0.7,
        model="gpt-4"
    ),
    TaskCompletionMetric(
        threshold=0.8,
        model="gpt-4"
    )
]

# Run evaluation
results = evaluate([test_case], metrics)

for metric_result in results:
    print(f"{metric_result.name}: {metric_result.score:.2f}")
    if metric_result.reason:
        print(f"  Reason: {metric_result.reason}")

The four core agent metrics are: Tool Correctness (did the agent call the right tools with right arguments), Faithfulness (is the response grounded in retrieved context rather than hallucinated), Answer Relevancy (does the response actually address the user’s question), and Task Completion (did the agent achieve the stated goal, assessed by an LLM judge for nuance beyond binary pass/fail).

Trajectory-Level Evaluation

Beyond final outputs, trajectory evaluation examines the agent’s reasoning path. An agent might reach the correct answer through a convoluted path, wasting tokens and time, or it might succeed on the test case but for the wrong reasons — lucky guesses that will fail on similar inputs.

Why Trajectory Evaluation?

An agent might reach the correct answer through a convoluted path, wasting tokens and time. Trajectory metrics reveal these inefficiencies that final-answer metrics miss, and detect loops and regressions before they appear in production.

from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np

@dataclass
class TrajectoryStep:
    state: Dict[str, Any]
    action: str
    observation: str
    reasoning: str
    is_error: bool = False

@dataclass
class AgentTrajectory:
    steps: List[TrajectoryStep]
    final_result: Any
    task_completed: bool

class TrajectoryEvaluator:
    def __init__(self, llm_judge=None):
        self.llm_judge = llm_judge

    def evaluate(self,
                 trajectory: AgentTrajectory,
                 optimal_steps: int = None) -> Dict[str, float]:
        metrics = {}

        # 1. Step efficiency
        if optimal_steps:
            metrics["step_efficiency"] = min(
                1.0, optimal_steps / len(trajectory.steps)
            )

        # 2. Action diversity (detect loops)
        actions = [s.action for s in trajectory.steps]
        metrics["action_diversity"] = len(set(actions)) / len(actions)

        # 3. Detect repetition (exact action sequences)
        metrics["repetition_score"] = self._detect_repetition(actions)

        # 4. Error recovery rate
        errors = [s for s in trajectory.steps if s.is_error]
        if errors:
            recoveries = self._count_recoveries(trajectory, errors)
            metrics["recovery_rate"] = recoveries / len(errors)
        else:
            metrics["recovery_rate"] = 1.0  # No errors = perfect

        # 5. LLM-as-judge for reasoning quality
        if self.llm_judge:
            metrics["reasoning_quality"] = self._judge_reasoning(trajectory)

        return metrics

    def _detect_repetition(self, actions: List[str]) -> float:
        """Return 1.0 if no repetition, lower if patterns repeat."""
        if len(actions) < 4:
            return 1.0
        for pattern_len in [2, 3]:
            for i in range(len(actions) - pattern_len * 2):
                pattern = actions[i:i + pattern_len]
                next_seq = actions[i + pattern_len:i + pattern_len * 2]
                if pattern == next_seq:
                    return 0.5  # Repetition detected
        return 1.0

Key trajectory metrics include step efficiency (actual steps versus the known optimal path), action diversity (penalizing repetitive actions that indicate the agent is stuck in a loop), progress rate (how much progress toward the goal each step achieves), error recovery rate (how often the agent successfully recovers after a tool failure or wrong action), and LLM-judged reasoning quality.

Major Agent Benchmarks

Benchmarks provide standardized tasks for comparing agent capabilities across research groups and over time.

Major agent benchmarks comparison
BenchmarkDomainTasksPrimary Metric
SWE-bench VerifiedSoftware Engineering500Resolved Rate
SWE-bench FullSoftware Engineering2,294Resolved Rate
WebArenaWeb Navigation812Task Success Rate
GAIA Level 1General Assistant~165Exact Match
GAIA Level 2General Assistant~186Exact Match
GAIA Level 3General Assistant~115Exact Match
τ-benchTool + Conversation680Pass Rate
HumanEvalCode Generation164Pass@1

SWE-bench evaluates agents on real GitHub issues from popular Python repositories like Django, Flask, and scikit-learn. The agent must understand the issue, locate relevant code, and generate a patch that passes the repository’s test suite. SWE-bench Verified uses 500 human-verified issues; SWE-bench Lite contains 300 simpler issues for faster iteration. WebArena tests realistic web navigation across five self-hosted websites including shopping, forums, code hosting, and maps. GAIA (General AI Assistant) poses questions requiring multi-step reasoning across web search, file processing, and calculation, at three difficulty levels. τ-bench evaluates multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.

LLM-as-Judge

Many agent behaviors resist deterministic evaluation. Response helpfulness, reasoning coherence, and instruction-following quality are inherently subjective. LLM-as-Judge uses a capable model to evaluate agent outputs against a rubric, providing more nuanced assessment than exact-match metrics.

┌──────────────────────────────────────────────────────────────────┐
│                      LLM-as-Judge Pipeline                        │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐     │
│   │ Agent       │    │ Judge        │    │ Structured      │     │
│   │ Output      │───▶│ Prompt       │───▶│ Evaluation      │     │
│   │             │    │ + Rubric     │    │                 │     │
│   └─────────────┘    └──────────────┘    │ • Score: 0-1    │     │
│                                          │ • Reasoning     │     │
│   ┌─────────────┐                        │ • Suggestions   │     │
│   │ Ground      │                        └─────────────────┘     │
│   │ Truth       │────────▶ Compare                               │
│   │ (optional)  │                                                │
│   └─────────────┘                                                │
│                                                                   │
│   Judge Models: any capable frontier model                        │
└──────────────────────────────────────────────────────────────────┘

Evaluation Best Practices

Evaluate at multiple layers rather than relying only on final answer accuracy. Use held-out test sets that were never seen during development. Include adversarial and edge cases — the cases that reveal failure modes, not just average performance. Track metrics over time to detect regressions before they reach production. Combine automated metrics with periodic human evaluation to catch what automated metrics miss. Report confidence intervals, not just point estimates, because many benchmark datasets are small enough that single-run scores have high variance.

Do not overfit to benchmark-specific patterns, ignore trajectory quality, skip safety evaluation, use only synthetic test cases, or assume benchmark scores predict production performance. Benchmarks measure a proxy for real-world capability; the gap between benchmark performance and deployment performance is often substantial and domain-specific.

Tags: evaluationmetricsbenchmarksllm-judge