The Emerging Coding Agent Runtime Stack

Sandboxes, subagents, and deploy CLIs are converging into a recognizable runtime stack for coding agents. A look at how the layers are forming.

Within a single week, every major player in the coding agent space shipped the same three things: sandboxed execution environments, subagent orchestration, and streamlined deployment pipelines. LangChain launched LangSmith Sandboxes alongside Open SWE and a langgraph deploy CLI. OpenAI shipped subagent support for Codex to general availability. Anthropic expanded context windows and automatic caching on their platform API. A new open-source control plane appeared for Claude Code and Codex with sandbox policies and session provenance.

These parallel launches point to a coding agent runtime stack with clearly separable layers — and understanding those layers matters more now than model selection for most production decisions.

The Three Layers That Just Locked In

If you squint at everything that shipped this week, you see the same architecture repeated across vendors and open-source projects:

Layer 1: Sandboxed Execution. LangSmith Sandboxes give agents isolated environments “in a single line of code.” The LACP control plane provides sandbox policies for Claude Code and Codex. Open SWE wraps everything in cloud sandboxes. The pattern is identical: the agent gets a disposable, networked container where it can run code, install dependencies, and observe results without touching the host. This is the equivalent of the container runtime layer in cloud infrastructure — it’s the thing that makes everything above it safe to iterate on.

Layer 2: Subagent Decomposition. OpenAI Codex now ships default subagents (“explorer,” “worker,” “default”). Simon Willison’s guide on subagents explicitly frames them as the solution to context window limits — break complex tasks into smaller agents that each fit within available context. Open SWE calls these “Deep Agents” and makes subagent orchestration a first-class architectural component. The idea is not new, but the standardization is: every serious coding agent now assumes a parent agent that plans and delegates to child agents that execute.

Layer 3: Deployment and Control Plane. LangChain’s langgraph deploy CLI, LACP’s one-command setup with quality gates and session provenance, and Anthropic’s automatic caching all address the same problem: making the harness operationally manageable. You need to deploy it, observe it, cache its repeated context, and enforce policies on what it’s allowed to do.

Note

This three-layer split — sandbox, subagent orchestration, control plane — maps almost exactly to the container runtime / orchestrator / control plane split in Kubernetes. That’s not a metaphor. It’s the same set of engineering pressures: isolation, coordination, and policy enforcement over autonomous units of work.

The Harness Eclipses the Model

Consider what OpenAI shipped this week alongside Codex subagents: GPT-5.4 mini and nano, models explicitly optimized for “sub-agent tasks” and “high-volume API workloads.” That framing is revealing. The frontier model is no longer the product — it’s a component designed to be called thousands of times inside someone else’s harness. Meanwhile, Mistral released Small 4, a single model that unifies reasoning, multimodal, and agentic coding capabilities. NVIDIA released a 120B MoE model with 12B active parameters and 1M context, purpose-built for agentic workflows.

All of these models are competing to be the best substrate for a coding agent runtime. None of them are competing to be the runtime itself. That distinction matters enormously for practitioners. When LangChain’s Open SWE blog post says it provides “the same patterns used by elite engineering companies like Stripe, Ramp, and Coinbase for their internal coding agents,” the claim isn’t about which model those companies chose. It’s about the harness architecture — the sandbox configuration, the subagent topology, the deployment pipeline.

This is exactly what happened with web application servers. By 2010, nobody chose their web framework based on which HTTP parser it used. They chose it based on the middleware ecosystem, the deployment story, and the operational tooling. Coding agents just crossed that threshold.

What This Means for Choosing (or Building) a Stack

If you’re building internal coding agents today, the week’s news narrows your real decision space. The runtime architecture is no longer a design problem — it’s a selection problem. The question isn’t “should we use subagents?” (yes) or “should we sandbox execution?” (obviously). The question is whether you adopt an integrated stack like Open SWE / LangGraph, compose your own from lighter primitives like Stirrup or LACP, or build a bespoke harness around raw model APIs.

The integrated stacks now have a significant lead in one critical area: they’ve shipped the deployment and observability layer alongside the runtime. Open SWE gives you Slack integration, PR creation, and subagent orchestration in one package. LangSmith Sandboxes tie execution isolation directly to the tracing and evaluation platform. If you’re composing your own stack, you need to build that glue yourself — and in practice, that glue is where most production time gets spent.

Warning

The risk of adopting an integrated stack is the same as it was with early web frameworks: tight coupling makes it hard to swap components later. If your sandbox layer, orchestration layer, and observability layer all come from the same vendor, migrating any one of them means migrating all three. Build explicit interfaces between layers now, even if you’re using a single vendor’s implementation for all of them.

Where This Goes in Six Months

The standardization of the coding agent runtime has two predictable second-order effects.

First, the competition shifts to control plane intelligence. When every coding agent has sandboxes and subagents, the differentiator becomes how the control plane manages them: how it routes tasks to subagents, how it decides when to spawn new sandboxes versus reuse existing ones, how it enforces quality gates and handles failures. LACP’s emphasis on “quality gates” and “session provenance” is early evidence of this shift. Expect the control plane to absorb more and more of what we currently think of as agent logic — retry policies, context management, checkpoint and resume.

Second, model providers will increasingly optimize for the subagent use case. GPT-5.4 nano isn’t a general-purpose model that happens to work as a subagent. It’s a model designed for high-volume, low-latency calls within a harness. Anthropic’s automatic caching targets the same workload — repeated system prompts across thousands of subagent invocations. The models are being shaped by the runtime, not the other way around. Practitioners should benchmark models specifically on subagent-scale tasks: small context windows, tool-heavy execution, high call volume. The model that wins a single-turn coding benchmark may perform poorly as a worker subagent called 200 times inside a planning loop.

The coding agent is no longer a clever prompt wrapped in a script. It’s a runtime with layers, and the layers just became obvious enough to name.