Context as a Deployable Artifact: The Third Layer of the Agent Stack

Agent context files are being pulled out of repos and into versioned, governed runtime stores — creating a third deployment surface alongside harness code and model weights.

For most of the past year, the agent stack had two moving parts: the harness (code, tools, orchestration) and the model (weights, provider, version). Context — the AGENTS.md files, skills, policies, few-shot examples — lived inside the harness repo, deployed when the harness deployed. That arrangement is breaking down. Context is becoming its own deployable artifact, with its own storage, versioning, governance, and release cadence.

The clearest signal, in mid-May 2026, is LangChain’s Context Hub: a dedicated store for context files with environment tagging and version pinning, explicitly separated from harness code. Anthropic’s guidance on long-running coding agents reinforces the same split — their planner-generator-evaluator architecture treats structured handoff documents (which are context) as a first-class concern distinct from the agent loop itself. Nous Research’s Hermes Agent goes further: the agent writes and refines its own skills from feedback, which means context files mutate at runtime, independently of any harness deploy. And LangSmith’s LLM Gateway, sitting between agents and providers to enforce spend caps and PII redaction, makes sense only if you accept that the context flowing through the agent is a governance surface in its own right.

This is the third layer materializing.

Why the split is happening now

Context used to be small enough that nobody worried about lifecycle. A system prompt, a handful of examples, maybe a tool description. You committed it, you deployed it, you moved on. That model breaks once you have:

Multiple agents sharing a common policy file that needs to update without redeploying every consumer
Skills that the agent itself authors or revises
Per-environment context (different examples for staging vs. production, different redaction rules per tenant)
Compliance review on prompts independent of code review
A/B testing of instruction variants against the same harness and model

Each of these pushes context toward the same lifecycle that configuration and feature flags went through a decade ago. You don’t redeploy your service to flip a flag. You shouldn’t redeploy your harness to swap an instruction set.

Cursor’s harness blog makes the related point from the opposite direction: as models improve, the harness shifts from static context-stuffing toward dynamic context fetching. That’s not just a quality optimization — it’s an architectural commitment that context lives outside the harness and is retrieved at runtime, like rows from a database rather than constants in a binary.

What this means for failure modes

Once context has its own lifecycle, it has its own failure modes — and several papers from the same period map them.

The constraint drift paper argues that safety-critical constraints get silently weakened as they pass through memory, delegation, and tool calls. That’s a context-layer failure: the constraint was correctly specified at the entry point, but the system has no mechanism to maintain it as execution state. Slipstream’s contribution is essentially the same observation applied to compaction — the summary that replaces full history is itself a context artifact whose fidelity needs validation, not assumption. OrchJail demonstrates the offensive version: attackers exploit the gap between context as specified and context as it actually shapes orchestration decisions.

Warning

When context moves to its own deployment layer, every guarantee you previously inherited from “it’s in the repo, it ships with the code” needs to be re-established. Version pinning, rollback semantics, drift detection between declared and effective context — these become operational concerns, not afterthoughts.

The LangSmith Engine release fits here too. It clusters production failures, diagnoses against code, and drafts PRs. But a meaningful fraction of agent failures are context bugs, not code bugs — a stale example, a missing policy clause, a skill that drifted. An improvement loop that can only touch code will misdiagnose those. The fact that Context Hub and Engine shipped the same week is not a coincidence; you need addressable, versioned context before automated diagnosis can target it.

The three-layer model

It’s worth naming the layers explicitly because they have different cadences, owners, and governance needs.

Layer	Deploy cadence	Typical owner	Failure signature
Model weights	Weeks to months	Provider or ML team	Capability regression, behavioral shift
Harness code	Days	Platform / agent engineers	Crashes, tool errors, orchestration bugs
Context	Hours to minutes	Product, domain experts, the agent itself	Policy drift, stale examples, instruction conflicts

The Artificial Analysis coding agent index is interesting precisely because it benchmarks model-harness pairs rather than models alone, acknowledging that the harness is a co-determinant of performance. The next version of that benchmark almost certainly needs to add context as a third axis — the same harness and model with different context bundles will produce materially different results, and we already know it.

Anthropic’s separate Agent SDK monthly credit is a commercial expression of the same shift. Programmatic SDK usage is metered apart from interactive usage because the workloads are structurally different. The same logic will apply inside the platform: context operations (publish, fetch, version, audit) will get their own metering, their own quotas, their own SLOs.

What to do about it

If you’re building production agents, three shifts are worth making now.

First, pull context out of your harness repo. Even a thin abstraction — a context loader that reads from a versioned store keyed by environment — buys you the ability to roll back instructions without a code deploy. You don’t need a hosted product for this; a Git repo of context files with tags and a fetch step is a credible starting point.

Second, instrument the context layer separately. Your traces should record which context version was active, not just which harness build. When LangSmith Engine or your equivalent clusters a failure, you want to know whether the regression coincides with a code change, a model swap, or a context update. Today most teams can’t answer that question.

Third, treat self-modifying context as a deployment event. If you’re following the Hermes pattern — agents authoring and refining their own skills — those skill writes need the same review, rollback, and audit treatment as a code merge. Otherwise you’ll discover, weeks later, that the agent has quietly rewritten its own constraints. The constraint drift paper is essentially a warning that this is already happening in systems that weren’t designed to notice.

Tip

A practical test: can you answer “what instructions was this agent running at 14:32 UTC last Thursday?” in under a minute? If not, your context layer isn’t versioned in a way that survives an incident review.

The agent stack has been stratifying for a while; mid-May 2026 made the third layer hard to ignore. Naming, instrumenting, and governing that layer explicitly means less time guessing why an agent’s behavior changed — and more of the changes shipped on purpose.