Automated Trace Analysis: Mining Agent Behavior Patterns at Scale

How to apply automated hierarchical clustering and LLM-driven summarization to production agent traces to surface failure modes, usage patterns, and behavioral trends without manual review.

As agent systems reach production scale, the volume of execution traces quickly outpaces any team’s ability to review them manually. A single day of traffic from a customer-facing agent might generate tens of thousands of spans, making it practically impossible to hand-inspect failure modes or spot emerging usage patterns. The solution is to treat your trace corpus as a dataset and apply LLM-driven analysis to cluster, categorize, and summarize it automatically.

Why Manual Trace Review Doesn’t Scale

In early development, reading traces one-by-one is exactly the right approach. You learn how your agent reasons, where it gets confused, and which tool calls are noisy. But in production, this workflow breaks down. The sheer count of traces isn’t the only problem — the diversity of interactions is. Users ask questions you never anticipated, chains fail in ways your test suite never exercised, and latency spikes cluster around specific input shapes that are invisible until you aggregate.

The instinct is to build dashboards: error rates, token counts, latency percentiles. These metrics are necessary but not sufficient. They tell you that something is wrong, not what the pattern is or which class of user inputs triggers it. To get from “error rate spiked on Tuesday” to “users asking about refunds with order numbers longer than 10 digits cause a regex failure in the extraction tool,” you need semantic understanding of trace content — which is itself an LLM task.

The Hierarchical Categorization Pattern

The most effective architecture for automated trace analysis uses a two-pass hierarchical approach. In the first pass, an LLM reads a sample of traces and proposes a set of top-level categories that cover the observed behavior space. In the second pass, each trace is assigned to a category and further broken down into subcategories. This mirrors how a human analyst would approach the problem: first get a gestalt view, then drill in.

Trace Corpus (raw spans)
        │
        ▼
┌───────────────────┐
│  Sampling Layer   │  ← configurable window, filters
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│  LLM Categorizer  │  ← Pass 1: infer top-level taxonomy
│  (taxonomy pass)  │
└────────┬──────────┘
         │  taxonomy
         ▼
┌───────────────────┐
│  LLM Classifier   │  ← Pass 2: assign each trace to category
│  (assignment pass)│
└────────┬──────────┘
         │
         ▼
┌───────────────────┐
│  Aggregator       │  ← merge metrics, feedback, attributes
└────────┬──────────┘
         │
         ▼
  Hierarchical Report
  (executive summary + category tree)

The key design decision is that the taxonomy is data-driven, not predefined. You are not asking the LLM “does this trace fall into category A, B, or C” — you are asking it to discover what categories exist. This means the system adapts to your specific application’s behavior rather than forcing traces into generic buckets like “success” and “failure.”

Note

The quality of top-level categories depends heavily on the sampling strategy. A uniform random sample works well for general-purpose analysis, but if you suspect a rare failure mode, bias your sample toward error traces or low-feedback traces to make the signal visible to the taxonomy pass.

Configuring the Analysis Job

A good trace analysis system exposes several configuration levers:

Trace selection — which project, time window, and filter criteria to analyze. You might want only traces that returned user feedback below a threshold, or only traces that hit a specific tool.

Instructions / focus question — a natural-language prompt that steers the LLM toward what you care about. “What are the main topics users are asking about?” produces a topic taxonomy. “What failure modes appear most frequently?” produces an error taxonomy. The same corpus, analyzed with different instructions, yields entirely different category structures.

Attribute extraction — structured fields to pull from each trace beyond its category assignment. These might include: the tool that was last called before an error, the number of LLM hops, whether the user expressed frustration in their final message. Extracted attributes let you correlate category membership with operational metrics.

Scheduling — running analysis on a recurring cadence (daily, weekly) lets you track how the category distribution shifts over time as your agent evolves or your user base changes.

Here is a minimal example of calling such an analysis programmatically against a corpus of chat histories:

from langsmith import Client
import os

client = Client()

# Load production chat histories from your own data store
chat_histories = load_chat_histories_from_db(start="2024-03-01", end="2024-03-07")

report = client.generate_insights(
    chat_histories=chat_histories,
    name="Support Topics - Week of March 1",
    instructions="What are the main topics and questions users are asking about? "
                 "Flag any interactions where the agent was unable to help.",
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Poll until complete, then inspect the returned report URL
client.poll_insights(report=report)
print(report)

Note that this pattern works on data that was never originally traced through your observability platform — you can feed raw conversation logs from any source.

Cost and Sampling Tradeoffs

Automated trace analysis is itself an LLM workload, and its cost scales with the number of traces sampled and the average length of each trace. A corpus of 1,000 threads typically costs $1–4 depending on the model family chosen. This is economical for weekly batch analysis, but it means you need to think carefully about sampling if you want to run analysis continuously or over very large corpora.

Tip

For high-volume systems, stratified sampling — ensuring representation across error rates, latency buckets, and user cohorts — gives you better category coverage than uniform random sampling at a fraction of the trace count. Aim for 500–2,000 traces per analysis run; beyond that, the taxonomy rarely changes but the cost grows linearly.

Choosing a smaller, faster model for the assignment pass (where each trace is simply classified against an already-established taxonomy) and reserving a more capable model for the taxonomy discovery pass is a reasonable cost-optimization strategy. The taxonomy pass needs to generalize from examples; the assignment pass is closer to a classification task.

Integrating Insights Into Your Development Loop

The output of automated trace analysis is most valuable when it feeds back into the development cycle rather than sitting as a standalone report. Concretely:

Failure categories become eval cases. When the analysis surfaces a subcategory like “users asking about international shipping receive incorrect carrier names,” that description is precise enough to generate a targeted test set for your evaluation harness.
Usage distribution informs fine-tuning data. If 40% of your traffic is in a category your agent handles poorly, that category should be overrepresented in any fine-tuning dataset you construct.
Category shift over time is a regression signal. A rising share of “agent asks for clarification” traces might indicate a prompt regression or a change in user behavior worth investigating.

Automated trace analysis transforms your production logs from an audit trail into a continuous feedback mechanism — a form of unsupervised evaluation that runs alongside your human-labeled benchmarks and complements them with real-world signal.