The Agent Harness Pattern: Opinionated Infrastructure for Complex Agents

How to design a batteries-included agent harness that bundles planning, file I/O, sub-agent delegation, and context management into a reusable, composable substrate.

Building a capable AI agent requires assembling the same set of primitives over and over: a way for the model to plan, a place to persist intermediate work, mechanisms to delegate subtasks, and guardrails for when context balloons out of control. The agent harness pattern codifies these recurring needs into a reusable infrastructure layer so you ship capabilities instead of boilerplate.

What an Agent Harness Actually Is

An agent harness sits between the raw LLM and your application logic. It is not a prompt template or a chain — it is closer to a runtime: an opinionated collection of tools, memory strategies, and execution policies that are wired together by default and overridable on demand.

The distinction matters. A plain tool-calling loop hands the model a list of functions and lets it improvise. A harness additionally provides:

Default tool contracts — The model receives pre-written guidance on when and how to use each tool, not just its schema.
Lifecycle policies — What happens when the context window fills up? When a subtask fails? When output exceeds a size threshold?
Composition primitives — A mechanism to spawn child agents with isolated context, pass them scoped instructions, and collect results.

The trade-off is explicitness for productivity. You accept the harness’s opinions in exchange for a working agent on day one, then peel back individual layers as you hit their limits.

The Core Tool Triad: Planning, Persistence, and Execution

Most complex agentic tasks decompose into three categories of action that a harness should support natively.

Planning tools let the model maintain a structured task list it can read and mutate. Rather than holding the entire plan in the system prompt, the model writes todos, marks them complete, and reorders them as it learns more. This externalizes working memory and makes plan state inspectable by your observability stack.

Filesystem tools — read, write, edit, list, search — give the model a durable scratchpad that survives individual LLM calls. Long outputs (search results, generated code, research summaries) get written to files instead of piling up in the context window. Subsequent steps reference the file path rather than the content, which keeps token counts manageable.

Execution tools (shell access, sandboxed command runners) let the model verify its own work. A coding agent that can run tests closes the feedback loop without human intervention. Sandboxing is non-negotiable here: the harness, not the model prompt, enforces what the execution environment can reach.

Warning

Do not rely on the model’s judgment to avoid dangerous shell commands. Sandbox at the infrastructure level — restrict filesystem paths, disable network egress, run in an ephemeral container — so that even a maximally compliant model cannot cause harm outside those boundaries.

Sub-Agent Delegation and Context Isolation

Long-horizon tasks often require parallel or sequential subtasks that each accumulate significant context. Running all of them in a single conversation thread causes context bloat and makes debugging a trace of thousands of tokens.

The harness pattern addresses this with a task delegation tool: the orchestrator invokes a sub-agent with a focused instruction and an isolated context window, receives a structured result, and continues. The sub-agent can itself use the full tool suite — including spawning further sub-agents — creating a recursive capability without shared state pollution.

┌─────────────────────────────────────┐
│           Orchestrator Agent        │
│  context: plan + task list          │
│                                     │
│  write_todos → [task A, task B]     │
│  task("Research X") ──────────────┐ │
│  task("Draft report") ────────┐  │ │
└───────────────────────────────│──│─┘
                                │  │
          ┌─────────────────────┘  │
          ▼                        ▼
 ┌─────────────────┐    ┌─────────────────┐
 │  Sub-Agent A    │    │  Sub-Agent B    │
 │  isolated ctx   │    │  isolated ctx   │
 │  read/write/run │    │  read/write/run │
 │  → returns str  │    │  → returns str  │
 └─────────────────┘    └─────────────────┘
          │                        │
          └──────────┬─────────────┘
                     ▼
           Orchestrator resumes
           with summarized results

This architecture keeps each agent’s context window bounded and meaningful. The orchestrator sees summaries; the sub-agents see only what they need to complete their slice.

Context Management as a First-Class Concern

Context bloat is inevitable in long-running agents. The harness should have explicit policies for it, not leave the model to cope ad hoc.

Common strategies a harness can implement automatically:

Auto-summarization: When the message list exceeds a token threshold, compress older turns into a structured summary and prepend it as a synthetic system message.
File offloading: Tool outputs larger than N tokens are written to the filesystem and replaced in the message stream with a file reference and a one-line description.
Rolling windows: Keep only the last K tool call pairs in active context; archive the rest to a retrievable store.

Note

Auto-summarization works best when you preserve structured data (file paths created, todos completed, key facts discovered) rather than compressing prose. Models lose less information when summaries are lists of facts rather than paragraphs.

The harness enforces these policies transparently, so individual tools and prompts do not need to account for context pressure.

Designing for Extensibility Without Sacrificing Defaults

A harness becomes a liability if it is too rigid. The pattern works best when the default configuration is genuinely useful out of the box, but every layer is independently replaceable.

Practically, this means:

Tools are additive: Users inject custom tools alongside the built-ins; the harness merges them and updates the model’s guidance accordingly.
Prompts are overridable per-layer: The system prompt, the per-tool usage guidance, and the sub-agent instructions are separate strings, not a monolithic template.
The model is a parameter: The harness should not assume a specific provider. Any model capable of tool calling should slot in.
The graph is the output: If the harness compiles to a standard graph format (LangGraph, a state machine, etc.), it integrates with streaming, checkpointing, and observability tooling without special-casing.

The goal is a steep default, shallow cliff: a new engineer gets a working complex agent in minutes, while a senior engineer can replace the planning tool, filesystem backend, or context policy without forking the entire framework.

# Minimal harness usage — defaults handle planning, files, sub-agents
agent = create_deep_agent()
result = agent.invoke({"messages": [{"role": "user", "content": "Analyze the codebase and write a migration plan"}]})

# Targeted customization — only override what you need
agent = create_deep_agent(
    model=init_chat_model("anthropic:claude-opus-4-5"),
    tools=[my_db_query_tool, my_deployment_tool],
    system_prompt="You are a platform reliability engineer.",
)

The agent harness pattern is ultimately about separating the generic hard parts of complex agents — planning, persistence, delegation, context control — from the domain-specific logic that makes your agent useful. Getting that separation right is what allows a single infrastructure investment to power many different agent applications.