Deep Research Agent Architecture: Planner, Tasks, and Observer
How to architect a production deep-research multi-agent system using a planner, parallel task workers, and a context-aware observer—with structured output and progressive content retrieval.
Deep research agents—systems that autonomously decompose a question, fan out to parallel investigations, and synthesize structured results—represent one of the most demanding multi-agent patterns in production use today. Getting the architecture right requires deliberate choices about how context flows between components, how work is parallelized, and how outputs are formatted for downstream consumers. This article walks through the core engineering decisions behind a production-grade deep research system.
The Three-Role Architecture
A deep research agent is naturally decomposed into three cooperating roles: a Planner, one or more Task workers, and an Observer.
The Planner receives the raw research query and is responsible for a single decision: what independent units of work need to happen, and what does each unit need to know? It emits a dynamic task list—not a static, hardcoded graph—so the system can scale from a one-step lookup to a dozen parallel investigations depending on query complexity.
Each Task worker is an isolated agent. It receives only three things: specific instructions scoped to its subtask, a required JSON output schema, and access to a set of retrieval tools. Task workers do not share state with one another during execution. This isolation keeps their reasoning clean and makes failures easy to attribute.
The Observer holds the privileged position: it maintains full context across every planning decision, intermediate reasoning step, and final output. It is the synthesis layer. Crucially, it is the only component that sees the full picture.
┌─────────────────────────────────────────────────────────┐ │ PLANNER │ │ Analyzes query → emits N task specs (dynamic) │ └────────────┬──────────────────────────────┬─────────────┘ │ task spec 1 task spec N │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ TASK WORKER │ . . . │ TASK WORKER │ │ tools + schema │ │ tools + schema │ │ → JSON output │ │ → JSON output │ └────────┬────────┘ └────────┬────────┘ │ cleaned outputs only │ └──────────────┬─────────────┘ ▼ ┌─────────────────────────┐ │ OBSERVER │ │ full context: plans, │ │ reasoning, citations, │ │ all task outputs │ │ → final structured │ │ response │ └─────────────────────────┘
Intentional Context Partitioning
The most important architectural decision in a system like this is not what data to collect—it is what data to withhold from each component.
Task workers receive only the cleaned final outputs of other tasks, never intermediate reasoning chains. This is intentional context engineering: feeding a task worker another worker’s chain-of-thought would bloat its context window, introduce noise, and risk cross-contaminating reasoning. What a task needs is a peer’s conclusion, not its deliberation.
The Observer, by contrast, needs the full picture to synthesize coherently. It should see planning rationale, tool call results, citations, and every task’s output. This asymmetry—restricted context for workers, full context for the observer—is the key insight that makes the pattern scale.
Context partitioning is not just a cost optimization. It is a correctness technique. Task workers that see too much context tend to anchor on prior reasoning rather than independently evaluating evidence, which degrades research quality.
Dynamic Task Generation
Rigid, hardcoded workflows are the enemy of a general-purpose research agent. A query like “What is the boiling point of water?” needs one task. A query like “Compare the regulatory landscape for autonomous vehicles across the EU, US, and China” needs many, and the right decomposition cannot be known ahead of time.
The Planner should therefore be a generative step, not a routing switch. In practice this means prompting the Planner to emit a structured list of task specifications—each with a goal, required tools, and output schema—and then instantiating workers dynamically from that list.
from typing import List
from pydantic import BaseModel
class TaskSpec(BaseModel):
task_id: str
goal: str
tools: List[str]
output_schema: dict # JSON Schema object
class PlannerOutput(BaseModel):
tasks: List[TaskSpec]
reasoning: str
# The planner node produces a PlannerOutput.
# The orchestrator spawns one worker per TaskSpec.
def spawn_workers(plan: PlannerOutput) -> List[dict]:
return [
{
"task_id": spec.task_id,
"instructions": spec.goal,
"output_schema": spec.output_schema,
"available_tools": spec.tools,
}
for spec in plan.tasks
]
This pattern allows a single orchestrator implementation to handle variable-width parallelism without modification.
Progressive Content Retrieval
Web research agents face a perennial tension: fetching full page content maximizes information but burns tokens fast; relying only on snippets is cheap but may miss critical detail.
The right design treats content retrieval as a two-stage decision. The task worker first attempts to answer its sub-question using search snippets alone. Only when snippet-level context is insufficient—when the model cannot satisfy the output schema with what it has—does it trigger a full-page fetch.
Implement this as a tool-routing decision inside the task worker rather than a preprocessing step. Give the worker a search_snippets tool and a separate fetch_full_page tool. The model will naturally reach for the cheaper tool first when the prompt makes the cost tradeoff explicit.
This approach is particularly effective because many research queries can be answered from snippet metadata alone—publication date, domain, headline, and the first few sentences cover a surprising fraction of lookup tasks. Full crawls are reserved for cases where depth genuinely matters.
Structured Output as a First-Class Contract
Consumer research tools can get away with returning a prose report. API-first research systems cannot. Every layer of the system—task workers and the final observer response—should emit validated JSON against a schema specified at call time.
The output schema should be provided by the caller, not hardcoded by the system. This makes the agent composable: a downstream pipeline that needs a citation list in a specific format simply passes that schema at invocation.
import openai, json
def run_task_worker(instructions: str, output_schema: dict, context: str) -> dict:
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a research assistant. Always respond with valid JSON matching the provided schema."},
{"role": "user", "content": f"Task: {instructions}\n\nContext:\n{context}\n\nOutput schema:\n{json.dumps(output_schema)}"},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Structured output also makes observability dramatically easier: you can validate, diff, and aggregate results programmatically rather than parsing free text.
Observability Requirements
A deep research agent operating in production will have highly variable token consumption—a single-task query and a twelve-task query are not remotely comparable in cost. Without per-run token visibility broken down by component, pricing and scaling decisions become guesswork.
At minimum, instrument each node to emit: total prompt tokens, completion tokens, cache hit rate, and tool call count. Aggregate these at the run level so you can correlate query complexity (number of tasks spawned) with cost. This data directly informs rate limits, per-query pricing tiers, and timeout thresholds.
Do not wait until post-launch to add token tracking. The reasoning token budget for an observer synthesizing twelve task outputs can be an order of magnitude larger than a single-step query. Discovering this in production billing rather than pre-launch profiling is painful.
The observer’s synthesis step is typically the most expensive single node and the one most sensitive to context size—monitor it separately from task worker costs so you can tune context-passing strategies independently.
This article is an AI-generated summary. Read the original paper: How Exa built a Web Research Multi-Agent System with LangGraph and LangSmith .