danielhuber.dev@proton.me Sunday, May 24, 2026

Context as a Deployable Artifact: The Third Layer of the Agent Stack

Agent context files are being pulled out of repos and into versioned, governed runtime stores — creating a third deployment surface alongside harness code and model weights.


May 16, 2026

For most of the past year, the agent stack had two moving parts: the harness (code, tools, orchestration) and the model (weights, provider, version). Context — the AGENTS.md files, skills, policies, few-shot examples — lived inside the harness repo, deployed when the harness deployed. That arrangement is breaking down. Context is becoming its own deployable artifact, with its own storage, versioning, governance, and release cadence.

The clearest signal this week is LangChain’s Context Hub: a dedicated store for context files with environment tagging and version pinning, explicitly separated from harness code. Anthropic’s guidance on long-running coding agents reinforces the same split — their planner-generator-evaluator architecture treats structured handoff documents (which are context) as a first-class concern distinct from the agent loop itself. Nous Research’s Hermes Agent goes further: the agent writes and refines its own skills from feedback, which means context files mutate at runtime, independently of any harness deploy. And LangSmith’s LLM Gateway, sitting between agents and providers to enforce spend caps and PII redaction, makes sense only if you accept that the context flowing through the agent is a governance surface in its own right.

This is the third layer materializing.

Why the split is happening now

Context used to be small enough that nobody worried about lifecycle. A system prompt, a handful of examples, maybe a tool description. You committed it, you deployed it, you moved on. That model breaks once you have:

  • Multiple agents sharing a common policy file that needs to update without redeploying every consumer
  • Skills that the agent itself authors or revises
  • Per-environment context (different examples for staging vs. production, different redaction rules per tenant)
  • Compliance review on prompts independent of code review
  • A/B testing of instruction variants against the same harness and model

Each of these pushes context toward the same lifecycle that configuration and feature flags went through a decade ago. You don’t redeploy your service to flip a flag. You shouldn’t redeploy your harness to swap an instruction set.

Cursor’s harness blog makes the related point from the opposite direction: as models improve, the harness shifts from static context-stuffing toward dynamic context fetching. That’s not just a quality optimization — it’s an architectural commitment that context lives outside the harness and is retrieved at runtime, like rows from a database rather than constants in a binary.

What this means for failure modes

Once context has its own lifecycle, it has its own failure modes — and several of this week’s papers are mapping them.

The constraint drift paper argues that safety-critical constraints get silently weakened as they pass through memory, delegation, and tool calls. That’s a context-layer failure: the constraint was correctly specified at the entry point, but the system has no mechanism to maintain it as execution state. Slipstream’s contribution is essentially the same observation applied to compaction — the summary that replaces full history is itself a context artifact whose fidelity needs validation, not assumption. OrchJail demonstrates the offensive version: attackers exploit the gap between context as specified and context as it actually shapes orchestration decisions.

The LangSmith Engine release fits here too. It clusters production failures, diagnoses against code, and drafts PRs. But a meaningful fraction of agent failures are context bugs, not code bugs — a stale example, a missing policy clause, a skill that drifted. An improvement loop that can only touch code will misdiagnose those. The fact that Context Hub and Engine shipped the same week is not a coincidence; you need addressable, versioned context before automated diagnosis can target it.

The three-layer model

It’s worth naming the layers explicitly because they have different cadences, owners, and governance needs.

LayerDeploy cadenceTypical ownerFailure signature
Model weightsWeeks to monthsProvider or ML teamCapability regression, behavioral shift
Harness codeDaysPlatform / agent engineersCrashes, tool errors, orchestration bugs
ContextHours to minutesProduct, domain experts, the agent itselfPolicy drift, stale examples, instruction conflicts

The Artificial Analysis coding agent index is interesting precisely because it benchmarks model-harness pairs rather than models alone, acknowledging that the harness is a co-determinant of performance. The next version of that benchmark almost certainly needs to add context as a third axis — the same harness and model with different context bundles will produce materially different results, and we already know it.

Anthropic’s separate Agent SDK monthly credit is a commercial expression of the same shift. Programmatic SDK usage is metered apart from interactive usage because the workloads are structurally different. The same logic will apply inside the platform: context operations (publish, fetch, version, audit) will get their own metering, their own quotas, their own SLOs.

What to do about it

If you’re building production agents, three shifts are worth making now.

First, pull context out of your harness repo. Even a thin abstraction — a context loader that reads from a versioned store keyed by environment — buys you the ability to roll back instructions without a code deploy. You don’t need a hosted product for this; a Git repo of context files with tags and a fetch step is a credible starting point.

Second, instrument the context layer separately. Your traces should record which context version was active, not just which harness build. When LangSmith Engine or your equivalent clusters a failure, you want to know whether the regression coincides with a code change, a model swap, or a context update. Today most teams can’t answer that question.

Third, treat self-modifying context as a deployment event. If you’re following the Hermes pattern — agents authoring and refining their own skills — those skill writes need the same review, rollback, and audit treatment as a code merge. Otherwise you’ll discover, weeks later, that the agent has quietly rewritten its own constraints. The constraint drift paper is essentially a warning that this is already happening in systems that weren’t designed to notice.

Tip

A practical test: can you answer “what instructions was this agent running at 14:32 UTC last Thursday?” in under a minute? If not, your context layer isn’t versioned in a way that survives an incident review.

The agent stack has been quietly stratifying for a while. This week made the third layer hard to ignore. The teams that name it, instrument it, and govern it explicitly will spend less time guessing why their agents changed behavior — and more time shipping the changes on purpose.

Tags: perspectivescontext-engineeringharnessinfrastructure