Measuring Agent Reliability Beyond Accuracy

A framework of twelve metrics across four dimensions — consistency, robustness, predictability, and safety — for evaluating how AI agents actually behave in production.

An agent that scores 72% on a benchmark tells you almost nothing about whether you should deploy it. That number hides whether the agent fails on the same tasks every time or different ones at random, whether it degrades gracefully when a tool call times out or collapses entirely, and whether its failures are benign formatting errors or irreversible destructive actions. Reliability — the multi-dimensional property that safety-critical engineering has studied for decades — is what separates a capable prototype from a deployable system.

The gap between accuracy and reliability is not theoretical. A coding agent that deletes a production database, a customer service bot that gives different answers to the same question on consecutive runs, an order fulfillment system whose latency varies by an order of magnitude between identical requests — these are all reliability failures that accuracy scores cannot detect. Borrowing from the engineering practices of aviation, nuclear power, and automotive safety, we can decompose agent reliability into four measurable dimensions.

The Four Dimensions of Reliability

Safety-critical industries have converged on four properties that together define whether a system is trustworthy enough to deploy. These translate directly to AI agents:

Consistency asks whether the agent behaves the same way when run multiple times under identical conditions. Flight-critical software must execute deterministically; reactor protection systems must respond identically each time conditions warrant shutdown. For agents, consistency means examining not just whether outcomes match across runs, but whether the agent takes similar paths to reach them and whether it consumes similar resources each time.

Robustness asks whether performance degrades gracefully or collapses abruptly when conditions deviate from nominal. Automotive safety testing evaluates sensor failure responses; aviation qualification tests hardware against temperature extremes and electromagnetic interference. For agents, this means measuring sensitivity to tool failures, environment changes (reordered JSON fields, renamed API parameters), and semantically equivalent prompt reformulations.

Predictability asks whether the agent knows when it is likely to fail. Nuclear risk assessment models failure modes and quantifies their probabilities; aviation certification assigns probability targets by severity category. For agents, this means assessing whether confidence scores are calibrated — does an agent claiming 80% confidence actually succeed roughly 80% of the time?

Safety asks how severe the consequences are when failures occur. Not all failures are equal: returning results in the wrong sort order is benign, but executing an unintended DELETE statement is catastrophic. Safety-critical fields separate failure probability from failure consequence, and agent evaluation should do the same.

Twelve Concrete Metrics

Each dimension decomposes into specific, computable metrics that are independent of raw accuracy. This independence is critical — a highly capable agent can be unreliable, and a less capable agent can be highly reliable within its operating envelope.

Consistency Metrics

Outcome consistency measures whether the agent succeeds or fails consistently on repeated attempts at the same task, normalized by the maximum possible variance for a given success rate. Trajectory consistency captures whether the agent takes similar paths, measured both distributionally (do action type frequencies match?) and sequentially (does action ordering match?). Resource consistency quantifies variability in cost, latency, and API calls across identical requests.

For robustness, three metrics cover distinct perturbation types: fault robustness measures resilience to infrastructure failures like API timeouts and malformed tool responses; environment robustness captures sensitivity to semantic-preserving changes in the operating environment; and prompt robustness measures invariance to equivalent reformulations of instructions.

Predictability uses three scoring-rule metrics: calibration (do stated confidence levels match empirical success rates?), discrimination (can confidence scores separate successes from failures?), and the Brier score as a proper scoring rule that jointly penalizes both miscalibration and poor discrimination.

Safety separates compliance (does the agent respect operational constraints like avoiding PII exposure?) from harm severity (when constraints are violated, how bad are the consequences?). The overall safety score follows the classical risk formulation: risk equals the product of violation probability and expected severity.

Reliability (R)
├── Consistency (R_Con)
│   ├── Outcome consistency       (C_out)
│   ├── Trajectory consistency    (C_traj)
│   └── Resource consistency      (C_res)
├── Robustness (R_Rob)
│   ├── Fault robustness          (R_fault)
│   ├── Environment robustness    (R_env)
│   └── Prompt robustness         (R_prompt)
├── Predictability (R_Pred)
│   ├── Calibration               (P_cal)
│   ├── Discrimination            (P_AUROC)
│   └── Brier score               (P_brier)
└── Safety (R_Saf)  [reported separately]
├── Compliance                (S_comp)
└── Harm severity             (S_harm)

What the Data Shows

Evaluations of 14 agentic models across two complementary benchmarks — GAIA (general assistant tasks requiring web browsing and multi-step reasoning) and τ-bench (customer service simulations with consequential actions) — reveal several patterns that matter for practitioners.

Reliability gains lag behind capability gains. Despite 18 months of model releases showing steady accuracy improvements, overall reliability shows only modest improvement. The correlation between accuracy and reliability varies across benchmarks, confirming that capability gains do not automatically yield reliability.

Outcome consistency is low across the board. Even frontier models fail to solve the same task consistently across runs. The gap between pass@k (at least one success in k attempts) and pass∧k (success across all k attempts) is substantial for every model tested, meaning agents that can solve a task often don’t do so reliably.

Agents handle infrastructure failures better than prompt variations. Models show ceiling effects on fault robustness and environment robustness — they handle genuine technical failures gracefully. But prompt robustness remains a key differentiator: sensitivity to superficial instruction paraphrasing varies substantially across models, which is counterintuitive and concerning for real-world deployment where user phrasing varies constantly.

Safety as a Hard Constraint

Safety is deliberately excluded from the overall reliability aggregate. Safety violations are tail phenomena — an agent that behaves safely 99% of the time but causes catastrophic harm in 1% of cases should not receive a high safety score simply because harmful events are rare. Safety metrics should be treated as hard constraints, not continuous measures to be traded off against other dimensions.

Practical Implications

These metrics are designed for direct adoption. Each requires only standard benchmark infrastructure plus multiple runs per task (five runs at temperature zero proved sufficient to reveal meaningful variance). The measurement protocol is straightforward: run each task multiple times for consistency, re-run under perturbations for robustness, extract confidence scores for predictability, and use LLM-based analysis for safety compliance.

For teams evaluating agents for production deployment, the key takeaway is that accuracy is necessary but insufficient. Two agents with identical benchmark scores can have fundamentally different reliability profiles — one failing on a fixed, identifiable subset of tasks (enabling systematic debugging), the other failing unpredictably (making debugging impossible). The twelve-metric framework gives practitioners the vocabulary and tools to make this distinction before deployment, not after an incident.