The Planner–Observer Pattern for Multi-Agent Research Systems

How to design multi-agent research systems using a planner that generates dynamic parallel tasks and an observer that maintains global context across all agents.

Deep research agents—systems that autonomously explore information sources and synthesize findings—expose a fundamental tension in multi-agent design: individual agents need focused, narrow context to reason well, while the system as a whole must maintain coherent global state. The planner–observer pattern resolves this tension by separating the roles of query decomposition, task execution, and cross-task synthesis into distinct agents with deliberately scoped visibility.

Core Roles and Responsibilities

The pattern has three primary components, each with a clear charter.

The planner receives the original user request and produces a structured execution plan. Its job is decomposition: analyzing the query, estimating complexity, and emitting a list of typed research tasks. Critically, the planner is the only component that sees the raw user intent in its entirety. After it emits the task list it is done—it does not participate in feedback loops or task monitoring.

Each task agent receives a narrow specification: a single research objective, a required output schema, and access to a specific set of tools. Task agents are stateless with respect to each other. They do not share a message queue, do not receive intermediate outputs from sibling tasks, and do not know how many other tasks are running. This isolation is intentional: it prevents one agent’s intermediate reasoning from contaminating another’s search strategy and keeps token budgets predictable.

The observer is the integration layer. It holds the full execution graph—the original plan, all task outputs, intermediate reasoning traces, and citation metadata. The observer is responsible for deciding when the research is complete, assembling final structured output, and optionally triggering re-planning if critical gaps remain.

Note

The key asymmetry in this pattern is intentional: the observer sees everything, but tasks see only their own cleaned outputs from other completed tasks—never raw reasoning chains. This prevents context bloat while preserving cross-task coherence at the synthesis stage.

Dynamic Task Generation vs. Fixed Workflows

A naive research agent uses a fixed pipeline: search, read, summarize, repeat. The planner–observer pattern replaces this with runtime task generation. When the planner analyzes a query, it decides how many tasks to create and what type each one should be. A simple factual lookup might generate a single task; a comparative market analysis might spawn eight parallel tasks with different tool configurations.

This dynamic scaling has meaningful engineering consequences. You cannot pre-allocate worker capacity or rely on deterministic execution graphs. Each plan invocation is a runtime decision. In practice this means the planner’s output schema needs to be stable even if its content varies wildly—a JSON array of task descriptors where every task has the same required fields regardless of its type.

{
  "tasks": [
    {
      "id": "task-1",
      "objective": "Find recent regulatory changes in EU AI Act enforcement",
      "output_schema": "RegulationSummary",
      "tools": ["web_search", "document_fetch"],
      "priority": 1
    },
    {
      "id": "task-2",
      "objective": "Identify key compliance vendors operating in this space",
      "output_schema": "VendorList",
      "tools": ["web_search"],
      "priority": 1
    }
  ]
}

The planner prompt must be engineered carefully. It should receive examples of both simple single-task plans and complex multi-task plans so the model learns to calibrate task count to actual query complexity rather than defaulting to a fixed number.

Context Engineering for Task Agents

Because task agents are isolated, what you include in each agent’s context window is a significant design decision. Two strategies sit at opposite ends of the tradeoff spectrum.

Full-content retrieval gives the agent access to complete documents. This maximizes reasoning quality but increases token usage substantially, especially across many parallel tasks.

Snippet-first reasoning provides only short excerpts initially. The agent reasons over snippets and only requests full document content when its intermediate conclusion is uncertain or when the snippet lacks the specific detail needed. This reduces average token consumption dramatically while preserving answer quality for the majority of tasks where snippets are sufficient.

Tip

Implement snippet-first as the default tool behavior and expose full-content retrieval as a separate tool the agent must explicitly call. This makes the cost tradeoff legible in your observability tooling—you can measure how often agents escalate from snippets to full content.

User Query
    │
    ▼
┌─────────┐      Task Specs (JSON)
│ Planner │ ─────────────────────────────────────────────┐
└─────────┘                                               │
                                                          ▼
                                          ┌─────────────────────────┐
                                          │   Task Dispatcher       │
                                          └──┬────────┬────────┬────┘
                                             │        │        │
                                      ┌──────┘  ┌────┘  ┌─────┘
                                      ▼         ▼       ▼
                                   ┌──────┐  ┌──────┐  ┌──────┐
                                   │Task 1│  │Task 2│  │Task N│
                                   │Agent │  │Agent │  │Agent │
                                   └──┬───┘  └──┬───┘  └──┬───┘
                                      │ cleaned  │ cleaned  │ cleaned
                                      │ output   │ output   │ output
                                      └──────────┴────┬─────┘
                                                      ▼
                                              ┌──────────────┐
                                              │   Observer   │◄─ Full trace,
                                              │              │   citations,
                                              │              │   all plans
                                              └──────┬───────┘
                                                     │
                                                     ▼
                                          Structured Final Output

Structured Output as a First-Class Constraint

Research systems built for human readers can tolerate unstructured prose. Systems built for API consumers cannot. When downstream code must parse and act on research results, reliability of output format matters as much as accuracy of content.

Enforcing structure at every layer—not just at final output—has compounding benefits. Task agents that emit typed JSON are easier to compose: the observer can merge their outputs programmatically rather than asking another LLM to reconcile free-text responses. Validation errors surface immediately, making debugging faster. And downstream consumers can schema-validate responses before processing, failing fast rather than propagating malformed data.

Function calling (tool use with a fixed return schema) is the practical mechanism for enforcing this. Each task agent is given a single “submit_result” function whose arguments match the task’s declared output schema. The agent cannot return a response except by calling that function, which guarantees the output is structurally valid before the observer ever receives it.

submit_result_tool = {
    "name": "submit_result",
    "description": "Submit the completed research result for this task.",
    "parameters": task_output_schema  # injected per-task at runtime
}

Observability Requirements for This Pattern

The planner–observer pattern runs many agents concurrently, each with its own tool calls, retrieval decisions, and token budgets. Without structured observability, production debugging becomes guesswork.

At minimum, instrument the following signals per execution:

Token usage per task agent, broken down by input, output, and cached tokens
Tool call counts and latency per task, to identify which tasks are expensive
Snippet-to-full-content escalation rate, as a proxy for retrieval difficulty
Observer synthesis latency, which often reveals whether the bottleneck is task execution or final assembly
Plan shape (number of tasks generated) correlated with query complexity, so you can validate the planner’s calibration over time

Warning

Token cost in a dynamic multi-agent system is non-deterministic by design—two identical queries can generate different task counts and different retrieval depths. Build your pricing and cost controls around percentile budgets, not fixed per-query estimates.

Tracking these signals at the span level—where each agent invocation, each tool call, and each observer step is a distinct trace node—gives you the granularity needed to identify which component is responsible for latency spikes or cost outliers in production.