Structured Trace Analysis for Agent Debugging: The SIR Pattern

How to apply a Summarize–Identify–Report pipeline with specialized sub-agents to compress, diagnose, and act on agentic execution traces at scale.

Agentic execution traces are simultaneously the richest debugging artifact you have and one of the hardest things to reason about systematically. A single run of a moderately complex agent can produce thousands of tokens spanning tool calls, intermediate reasoning, retries, and branching decisions — far too much for a human reviewer to triage at scale, and often too noisy for a single LLM pass to handle reliably. A structured pipeline that delegates different analytical jobs to specialized sub-agents turns this problem into something tractable.

Why Raw Traces Are Hard to Analyze

When an agent fails, the naive approach is to dump the full execution trace into a prompt and ask a model to explain what went wrong. This works surprisingly well for short, simple runs. It breaks down quickly in production.

The core problem is information density mismatch. A trace contains at least three qualitatively different kinds of content: high-signal behavioral events (a tool returning an unexpected error, a planner changing its goal mid-run), low-signal bookkeeping (parameter serialization, HTTP headers, timestamps), and redundant context (repeated system prompt injections, duplicate retrieval results). A single LLM reviewer treats all three categories with roughly equal attention, which wastes context budget and degrades analytical precision.

A secondary problem is that failure diagnosis and report generation are different cognitive tasks. Spotting that a retry loop was triggered because of a malformed JSON response requires pattern-matching over a sequence of events. Producing a structured incident report that an engineer can act on requires synthesis, prioritization, and a different output schema. Collapsing both into one pass means neither gets done well.

Warning

Passing a full raw trace to a single LLM reviewer is not a reliable debugging strategy for production agents. Context limits aside, the model cannot simultaneously compress noise, identify failure modes, and produce actionable reports at equal quality.

The SIR Pipeline: Summarize, Identify, Report

The SIR pattern structures trace analysis as a three-stage pipeline, each stage handled by an agent with a narrow, well-defined responsibility.

Summarize. The first agent’s job is compression without behavioral loss. It reads the raw trace and produces a structured intermediate representation — a TraceFormat — that strips bookkeeping noise while preserving every event that has causal relevance to the outcome. Think of this as producing a lossily compressed replay: a human or downstream agent reading it should be able to reconstruct the agent’s decision path without wading through serialization artifacts. Concretely, the summarizer extracts tool call sequences with inputs and outputs, reasoning steps that changed the agent’s plan, error signals and retry triggers, and goal/subgoal transitions.

Identify. The second agent operates on the TraceFormat, not the raw trace. Its job is failure detection and root-cause classification. Because it receives structured, noise-reduced input, it can apply pattern matching more reliably: did the agent enter a retry loop? Did it hallucinate a tool parameter that caused a downstream cascade? Did it abandon a valid subgoal prematurely? The identifier outputs a structured failure taxonomy entry — a label plus the specific trace evidence that supports it.

Report. The third agent synthesizes the identifier’s output into a human-readable, actionable incident report. This is where the analysis becomes useful to an engineer: what failed, why, what the observable symptoms were, and what remediation options exist. Separating this from identification means the report agent can be tuned for format, verbosity, and audience without affecting analytical accuracy upstream.

Raw Execution Trace
        │
        ▼
┌───────────────────┐
│  Summarizer Agent │  → strips noise, extracts causal events
└───────────────────┘
        │  TraceFormat (structured intermediate)
        ▼
┌───────────────────┐
│  Identifier Agent │  → detects failure patterns, classifies root causes
└───────────────────┘
        │  Failure taxonomy + evidence
        ▼
┌───────────────────┐
│   Report Agent    │  → produces actionable incident report
└───────────────────┘
        │
        ▼
  Structured Diagnostic Output

Designing the TraceFormat Abstraction

The TraceFormat intermediate representation is the linchpin of the pipeline. Get it wrong and you either lose diagnostic signal (over-compression) or push the noise problem downstream to the identifier (under-compression).

A practical TraceFormat schema for most agent architectures should capture:

{
  "run_id": "string",
  "goal": "string",
  "steps": [
    {
      "step_id": 1,
      "type": "tool_call | reasoning | goal_update | error",
      "tool": "string (if tool_call)",
      "input_summary": "string",
      "output_summary": "string",
      "outcome": "success | partial | failure",
      "triggered_retry": false,
      "causal_notes": "string (optional)"
    }
  ],
  "terminal_state": "success | failure | timeout | abandoned",
  "failure_signals": ["list of anomalous events"]
}

The input_summary and output_summary fields are where the summarizer agent does real work: it must produce a lossless-for-debugging condensed version of potentially large tool inputs and outputs. For RAG steps this might mean preserving retrieval scores and top-k document snippets rather than full retrieved text. For code execution steps it might mean preserving stdout, stderr, and exit code but not the full execution environment dump.

Tip

Treat the TraceFormat as a versioned schema. As your agent architecture evolves — new tools, new reasoning patterns — your summarizer prompt and format spec will need to evolve with it. Version-pin your TraceFormat alongside your agent version so that historical traces remain analyzable.

Implementing the Pipeline in Practice

The three-agent pipeline maps naturally onto any multi-agent orchestration framework. The orchestrator fans out the summarizer on each trace (this is embarrassingly parallel across a batch of runs), then fans in the identifier results, then generates reports.

async def analyze_trace_batch(traces: list[RawTrace]) -> list[IncidentReport]:
    # Stage 1: parallel summarization
    trace_formats = await asyncio.gather(
        *[summarizer_agent.run(trace) for trace in traces]
    )

    # Stage 2: parallel failure identification
    identifications = await asyncio.gather(
        *[identifier_agent.run(tf) for tf in trace_formats]
    )

    # Stage 3: report generation (can also be parallelized)
    reports = await asyncio.gather(
        *[report_agent.run(ident) for ident in identifications]
    )

    return reports

For production use, add a routing step before the identifier: if the summarizer’s terminal_state is success and failure_signals is empty, skip deep identification and emit a lightweight success record instead. This keeps costs proportional to the actual failure rate in your trace population.

It’s also worth running the identifier agent with structured output mode (JSON schema enforcement) if your LLM provider supports it. The failure taxonomy labels need to be consistent across runs to be aggregable — free-form natural language descriptions will drift.

Connecting Trace Analysis to Agent Improvement

The SIR pipeline produces value at two timescales. In the short run, incident reports help on-call engineers diagnose individual agent failures faster than manual trace review. In the medium run, the structured failure taxonomy accumulates into a dataset that reveals systemic issues: which tool is most frequently implicated in cascading failures, which goal types reliably trigger retry loops, which reasoning patterns precede task abandonment.

This aggregate signal is where the real leverage is. A failure taxonomy built from hundreds of runs can directly inform prompt revisions (add explicit handling for the error pattern the agent keeps misclassifying), tool interface redesigns (the tool whose malformed outputs trigger the most retries), and fine-tuning datasets (traces where the identifier flagged a reasoning failure are candidates for behavioral cloning from corrected traces).

Note

Structured trace analysis closes the feedback loop between production agent behavior and agent development. Treating the identifier’s failure taxonomy as a first-class engineering artifact — tracked in version control, reviewed in retrospectives — is how teams move from reactive debugging to systematic agent improvement.

The SIR pattern is not magic: it depends on the quality of each sub-agent’s prompts, and those prompts need to be calibrated to your specific agent architecture and tool set. But the structural separation it imposes — compress first, analyze second, report third — is sound engineering regardless of the underlying models, and it scales to trace volumes that make manual review impractical.