The Harness Stops Being Generic: Model-Specific Profiles and Runtime-Authored Workflows

Two converging shifts — per-model harness profiles and agents that write their own orchestration at runtime — are breaking the assumption that an agent harness is a stable, model-agnostic abstraction.

For most of the last two years, the agent harness has been treated as a stable abstraction: a loop, some middleware, a tool registry, and a model slot you could swap. The bet was that harnesses would generalize and models would commodify underneath them. This week’s releases suggest the opposite is happening. Harnesses are specializing per model, and in some cases the agent is writing its own harness at runtime. The ‘portable agent’ assumption is quietly eroding.

Two signals that point the same direction

LangChain reported a 10–20 point improvement on a tau2-bench subset by shipping model-specific harness profiles for Deep Agents — distinct prompts, tool shapes, and middleware tuned per model family. That’s not a tuning detail; that’s an admission that the harness no longer cleanly separates from the model. Harvey’s verifier work points the same way: batching judge calls and swapping in DeepSeek V4 Flash cut verifier cost by an order of magnitude, but only after the prompt and call structure were redesigned around that specific model’s behavior.

Meanwhile, Anthropic’s Claude Code dynamic workflows let Claude author its own orchestration at runtime — spawning tens to hundreds of parallel subagents, choosing topologies (parallel verification, adversarial passes) on the fly. The harness here is not a fixed graph the engineer designed; it’s an artifact the model emits per task. LangChain’s own create_agent post frames the same idea from the opposite direction: the harness is a minimal primitive whose value comes from how tightly it fits a specific task.

Harmonic’s Scout rebuild closes the loop. They tore out a hand-designed multi-subgraph LangGraph pipeline and replaced it with a single frontier model in a Deep Agents harness plus two tool sets. Simpler harness, better results, faster iteration. The structure that used to live in the graph now lives in the model — but only when paired with a harness profile that lets the model use it.

What’s actually converging

The interesting pattern is not ‘harnesses are getting more complex’ or ‘harnesses are getting simpler.’ Both are happening simultaneously, and they’re happening because the harness/model boundary is moving.

Note

The harness used to be the place where you encoded task structure because the model couldn’t be trusted to. As models get better at planning and self-orchestration, structural intelligence migrates into the model — but the interface between model and environment becomes more model-specific, not less.

Concretely, three things are co-moving:

Tool surfaces are becoming model-shaped. Nous’s Hermes Tool Search collapses large MCP tool arrays into three bridge tools once schemas exceed 10% of context. NVIDIA’s Nemotron 3 Ultra ships with configurable reasoning modes specifically tuned for agentic tool use. Anthropic added mid-conversation system messages — a primitive only useful if your harness is designed to inject per-turn instructions without breaking the cache.
Orchestration is moving inside the model context. Claude Code’s dynamic workflows and the MACU paper’s manager-dispatches-DAG pattern both push the planning graph into a single model’s reasoning rather than a framework’s control flow. The harness provides the spawning primitive; the model provides the topology.
Verification and governance are moving outside. The OCL paper’s pre-execution policy layer, LangSmith’s LLM Gateway, and Harvey’s batched verifiers all sit outside the agent loop. Decision generation is moving inward; decision checking is moving outward.

This is the same split that happened in databases when query planners absorbed optimization logic but pushed constraints and access control to a separate layer. The middle — the hand-tuned execution plan — got squeezed out.

What breaks if you ignore this

The practical consequence: ‘write once, swap models’ was always aspirational, but it’s now actively misleading as a design goal. If a model-specific harness profile is worth 10–20 points on tau2-bench, then a generic harness is leaving that performance on the floor every day in production. Worse, the gap will grow as models develop more idiosyncratic strengths — Claude’s effort/thinking calibration, Gemini’s long-context behavior, Nemotron’s reasoning toggle.

The Channel Fracture paper is a warning shot from the other direction. Multi-agent orchestration frameworks have silent failure modes — scheduled agents that don’t actually write to shared memory — that only show up when the harness assumes a uniform execution model that the underlying agents don’t actually share. Generic abstractions hide real divergence.

Warning

If your evals run against one model and your production runs against another, your harness profile mismatch is probably a larger source of regression than any prompt change you’re tracking. Profile drift is the new prompt drift.

What to do differently

A few concrete shifts worth making now:

Treat the harness profile as a first-class artifact. Version it per model family. When you change models, expect to rewrite middleware, tool descriptions, and effort settings — not just the model string. Test the combination, not the model.
Decide where orchestration lives, explicitly. Either the model writes the plan (Claude Code dynamic workflows, Deep Agents style) or your framework does (LangGraph subgraphs). Hybrid systems where both try to plan are the ones that produce the most confusing traces. Harmonic’s experience suggests that when the model is strong enough, pushing structure into the model wins on iteration speed.
Move verification out of the loop. Batched LLM-as-judge, pre-execution policy gates, and gateway-level redaction are cheaper and more debuggable than mid-loop checks. The OCL and Harvey results both point to the same architecture: a fast inner loop and a separate, model-agnostic verification layer that doesn’t need to match the agent’s model.
Plan for model-authored harnesses. If you’re building tooling for agents that spawn subagents at runtime, your observability needs to handle a dynamic topology — the trace structure isn’t known until execution. SIR-style structured trace analysis and trace-as-primary-artifact tooling get more important, not less, as the graph stops being declarative.

The portable, model-agnostic agent harness was a useful fiction for early prototyping. In production, the harness is becoming a tightly coupled assembly of model, profile, and verification layer — with the agent itself increasingly responsible for authoring its own execution plan inside that assembly. The engineering work is no longer about designing the perfect graph. It’s about designing the smallest harness that lets a specific model plan well, and the strongest external layer that catches it when it doesn’t.