Agent Observability: Monitoring AI Agents in Production

How to instrument, trace, and evaluate AI agents running in production, where non-deterministic behavior and infinite input spaces make traditional APM tools insufficient.

Shipping an AI agent to production is not the end of the engineering work — it’s the beginning of a different kind. Unlike traditional software where you can enumerate failure modes and verify behavior against a finite test matrix, agents operate over an unbounded input space with probabilistic outputs, meaning you will encounter behaviors in production that never appeared during development. Observability for agents requires a fundamentally different approach than conventional APM.

Why Traditional Monitoring Falls Short

Application performance monitoring tools were designed for deterministic, structured systems. They answer questions like: did the request succeed? How long did it take? Which database query was slow? These signals matter for agents too, but they don’t tell you whether the agent actually did the right thing.

Consider the difference between a conventional API endpoint and an agent handling a customer support query. The API either returns a valid response or it doesn’t — the status code tells you most of what you need to know. The agent might return HTTP 200 with a fluent, confident answer that is factually wrong, misinterprets the user’s intent, or calls the wrong tool entirely. The system metrics look healthy while the user experience is broken.

Agents also behave differently from stateless request handlers. A single user interaction can involve multiple LLM calls, tool invocations, retrieval steps, and branching decision points. The unit of observation is not a single request — it is a trajectory: the full sequence of steps the agent took to arrive at its final output.

Warning

A passing status code means nothing for agent quality. An agent that confidently hallucinates or selects the wrong tool will appear healthy to any monitoring system that only tracks latency and error rates.

What to Capture: Traces and Trajectories

The foundational primitive for agent observability is the trace — a structured record of everything that happened during a single agent run. A useful trace includes:

The raw user input, exactly as received
Each intermediate step: LLM calls with full prompt and completion, tool calls with arguments and return values, retrieval queries with the documents returned
Timing information for each step
The final output returned to the user
Metadata: model version, temperature, retrieved chunk IDs, tool names

For multi-turn conversations, traces need to be grouped into sessions so you can reconstruct the full context across exchanges. A single-turn trace that looks fine may reveal a problem only when you see the three prior turns that led to it.

User Message
     │
     ▼
┌─────────────────────────────────────┐
│           Agent Run (Trace)         │
│                                     │
│  Step 1: LLM Call                   │
│    ├─ prompt snapshot               │
│    └─ completion + tool choice      │
│                                     │
│  Step 2: Tool Call → search_orders  │
│    ├─ arguments: {order_id: 12345}  │
│    └─ result: {status: "shipped"}   │
│                                     │
│  Step 3: LLM Call (final response)  │
│    ├─ prompt + tool result          │
│    └─ completion                    │
│                                     │
│  Final Output → User                │
└─────────────────────────────────────┘
     │
     ▼
  Trace stored with full step log,
  latency, model metadata, session ID

Capturing this data requires instrumenting your agent framework at the right level of abstraction. Most frameworks — LangChain, LlamaIndex, custom ReAct loops — expose hooks or callbacks you can use to record each step without modifying your core logic.

# Example: recording a trace step in a custom agent loop
import time

def traced_tool_call(tool_fn, tool_name, arguments, trace):
    start = time.perf_counter()
    try:
        result = tool_fn(**arguments)
        trace.add_step({
            "type": "tool_call",
            "tool": tool_name,
            "arguments": arguments,
            "result": result,
            "latency_ms": (time.perf_counter() - start) * 1000,
            "error": None,
        })
        return result
    except Exception as e:
        trace.add_step({
            "type": "tool_call",
            "tool": tool_name,
            "arguments": arguments,
            "result": None,
            "latency_ms": (time.perf_counter() - start) * 1000,
            "error": str(e),
        })
        raise

Scaling Evaluation: From Manual Review to Automated Judgment

During development you can read every trace manually. In production, with thousands of interactions per day, that is not feasible. The challenge is preserving judgment quality while scaling throughput.

The practical solution most teams converge on is LLM-as-judge: using a separate LLM call to evaluate agent outputs against defined criteria. You write evaluation prompts that ask a model to score a response on dimensions like correctness, relevance, tone, or tool selection accuracy, then run those evaluators over your production trace logs on a continuous basis.

# Example: simple LLM-as-judge evaluator
def evaluate_response(user_query: str, agent_response: str, llm) -> dict:
    prompt = f"""You are evaluating a customer support agent response.

User query: {user_query}
Agent response: {agent_response}

Score the response on the following criteria (1-5):
1. Correctly understood user intent
2. Provided accurate information
3. Response was complete (no missing steps)

Return JSON: {{"intent": int, "accuracy": int, "completeness": int, "reasoning": str}}"""
    
    result = llm.complete(prompt)
    return parse_json(result)

LLM-as-judge has known failure modes: models can be sycophantic, inconsistent across similar inputs, or systematically biased toward longer responses. Calibrate your evaluators by sampling a few hundred traces, having a human score them, and comparing the judge’s scores to ground truth. A judge with low agreement to human labels is worse than no automated evaluation at all.

Tip

Start with a small, high-confidence set of human-labeled examples and use them to validate your LLM-as-judge before deploying it at scale. Recalibrate periodically as your agent’s behavior evolves.

Beyond automated judges, implement user feedback signals wherever possible. Thumbs up/down buttons, explicit corrections, or follow-up queries that indicate confusion are all weak but cheap labels. Even a 1–2% feedback rate over high-volume traffic produces a large labeled dataset over time.

Detecting Drift and Regressions

Production agents degrade in ways that are hard to detect without active monitoring. Common failure patterns include:

Prompt sensitivity drift: An upstream change — a new system prompt, a model upgrade, a change to retrieved context format — alters behavior on inputs that previously worked correctly. Without a baseline distribution of scores to compare against, these regressions are invisible until users complain.

Tool failure accumulation: Tools are external dependencies. APIs change, schemas shift, rate limits tighten. Tracking tool call success rates and error distributions separately from overall agent health lets you isolate upstream breakage quickly.

Distribution shift: The queries your users send in month three may be qualitatively different from the queries in month one, especially as your user base grows. Monitor the distribution of input topics, detected intents, or embedding clusters over time to catch shifts before they become quality problems.

A useful operational pattern is to maintain a shadow evaluation set derived from real production traces — a few hundred representative examples sampled and labeled once. Run this set through your evaluation pipeline on every deployment. A drop in scores relative to the previous deploy is a regression signal before any user notices.

Closing the Loop: Production Traces as Training Data

The most valuable long-term use of production observability infrastructure is feeding improvements back into the agent. Traces that received poor automated scores or negative user feedback are candidates for few-shot examples, fine-tuning datasets, or regression tests.

This creates a virtuous cycle: production behavior reveals failure modes, evaluation infrastructure surfaces and labels them, and the labeled data improves the agent. Teams that establish this loop early find that their agents improve continuously without requiring large, expensive manual curation efforts.

Note

Treat your production trace store as a first-class engineering asset. The traces you collect in the first few months of deployment are often the highest-signal training and evaluation data you will ever have for your specific use case.

Observability is not a feature you add after the agent is working — it is a prerequisite for knowing whether the agent is working at all. The engineering investment in trace capture, evaluation pipelines, and drift detection pays back immediately in faster debugging and compounds over time as production data drives continuous improvement.