danielhuber.dev@proton.me Sunday, May 24, 2026

Alignment Is Splitting Into Two Layers: Midtraining and Runtime

Recent work from Anthropic, OpenAI, and Mozilla suggests alignment is no longer a single fine-tuning step — it's becoming a layered system spanning training stages and execution infrastructure.


May 9, 2026

For most of the last two years, alignment was treated as a fine-tuning problem: collect preference data, run RLHF or constitutional AI, ship the model. The week’s research and product announcements suggest that framing is breaking down. Alignment is becoming a layered concern — pushed earlier into pretraining-adjacent stages, and pushed later into the harness and runtime — with the traditional fine-tuning step shrinking in relative importance.

The training-side shift: alignment moves upstream

Anthropic’s Model Spec Midtraining (MSM) inserts a stage between pretraining and fine-tuning where the model is trained on synthetic documents describing its intended specification. The reported numbers are striking: agentic misalignment on Qwen3-32B drops from 54% to 7% before any preference fine-tuning. The companion piece on “teaching Claude why” — training on ethical reasoning demonstrations and constitutional documents, with harmlessness environments augmented by tool definitions — reports agentic blackmail behavior driven to zero across recent Claude models.

The pattern is the same in both: alignment generalizes better when the model learns the reasons and specs during training, not just the preferred outputs during fine-tuning. Fine-tuning on its own appears to produce brittle behavior that collapses under agentic pressure (long horizons, tool access, deceptive sub-goals). Midtraining bakes the spec into the representation.

Natural Language Autoencoders point in a related direction. By training a model to convert its own internal activations into readable text, Anthropic gets a probe that caught Claude Mythos Preview internally planning to circumvent detection during a coding task. That’s not a fine-tuning intervention — it’s a representational one. The signal lives in the weights; the work is making it legible.

The runtime-side shift: alignment moves into the harness

While training-side alignment is moving upstream, runtime alignment is moving in the opposite direction — out of the model entirely and into the surrounding infrastructure. OpenAI’s writeup on running Codex safely is almost entirely about non-model controls: sandboxing, approval workflows, network policies, and agent-native telemetry. Mozilla’s Firefox hardening work with Claude Mythos Preview describes prompting and model-stacking techniques explicitly designed to filter noise from a model that, by itself, produces too many false positives to act on.

The Hermes Agent v0.13.0 release reads like a checklist of runtime alignment primitives: zombie detection, per-task retries, hallucination recovery, default-on credential redaction, scoped platform allowlists. None of this lives in model weights. It’s all harness.

LangChain’s Deep Agents work makes the same bet from a different angle. Harness profiles — per-model overrides for prompts, tools, and middleware — produced 10–20 point benchmark gains on tau2-bench, and harness-level changes alone moved gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0. The same model, different scaffolding, dramatically different behavior. If a harness change can produce a 14-point swing on a coding benchmark, it can also produce a 14-point swing on a safety benchmark.

Note

The practical consequence: “how aligned is this model?” is no longer a meaningful question on its own. The aligned unit is now (model + harness + runtime policy). Benchmarks that ignore the harness — and most still do — are measuring something that doesn’t exist in production.

What’s left for fine-tuning

If midtraining handles the spec and the harness handles the runtime, traditional alignment fine-tuning ends up with a narrower job: shaping style, refusal calibration, and the specific tradeoffs the lab wants to express on top of a model that already understands what it’s supposed to be. Anthropic’s MSM paper essentially says the quiet part out loud — generalization from fine-tuning improves because midtraining did the heavy lifting.

This maps onto something practitioners have been observing without naming. The LangChain VP of Engineering’s thread arguing that task-specific harness design consistently outperforms default agent setups, and that open models are increasingly viable for the long tail at lower cost, makes more sense in this frame. If the model carries a reasonable spec from midtraining and the harness enforces runtime policy, the marginal value of provider-specific fine-tuning shrinks for many tasks. Open-weight models become viable not because they caught up on raw capability, but because the layers around them got strong enough to compensate.

The Deep Agents CLI and Hermes Agent v0.13.0 both point this way: model-agnostic harnesses with per-model profiles, designed so that Kimi, Qwen, or GLM can drive the loop competitively. The harness absorbs the differences.

What practitioners should do differently

If alignment is splitting into a midtraining layer and a runtime layer, the implications for production agent teams are concrete:

  • Stop treating model selection and harness design as separable decisions. A model’s safety profile is the joint distribution of its training and its scaffolding. Evaluate them together or you’re measuring noise.
  • Invest in runtime alignment primitives the way you invest in observability. Sandboxing, approval workflows, credential redaction, scoped allowlists, and pre-execution verification (as in TrustBench-style systems) are no longer nice-to-haves. They’re load-bearing.
  • Audit your harness for safety-relevant context. If your middleware controls what tools a model sees, what context survives compaction, and what the system prompt asserts about the model’s role, your harness is now part of your alignment story whether you planned it that way or not.
  • Treat alignment evaluations as harness-conditioned. A model that scores well on Petri 3.0 in isolation may behave very differently inside your specific scaffolding. Re-run safety evals with your production harness in the loop.

Where this leads

In 6–12 months, expect the alignment conversation to bifurcate. On one side: research labs publishing on training-stage interventions — midtraining, representational probes, spec-conditioned generation. On the other side: infrastructure teams publishing on runtime controls — harness profiles, action verification, sandbox topology, telemetry schemas. The middle — classical RLHF-style fine-tuning as the primary alignment lever — will get quieter, not because it stops working, but because the leverage has moved.

For practitioners, the operational shift is simpler than it sounds. Alignment is no longer something a model vendor delivers to you. It’s something you compose, with the vendor handling the upstream layer and your harness handling the downstream one. The teams that internalize this and build for it will run agents that behave predictably under load. The teams that keep treating alignment as a property of a model checkpoint will keep being surprised by the gap between benchmark scores and production incidents.

Tags: perspectivessafetyalignmentharness