When Evals Become Optimization Targets
Optimizing agent harnesses against a fixed eval suite triggers Goodhart's Law — the same dynamic that eroded search quality through SEO. How adversarial eval co-evolution can help.
A common practice in agent engineering is using your eval suite as a training signal to optimize the agent harness — measure what works, adjust the scaffolding, re-measure, repeat. It’s a clean loop, and it produces real improvements in the short term.
The problem is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. The moment your evals become an optimization target, they start measuring “what scores well on these specific tests” rather than “good agent behavior.” This is a mechanism we’ve seen play out before at massive scale — it was called SEO, and understanding that parallel helps explain both the risk and the fix.
The Mechanism
The premise is sound: agent performance depends more on the harness than the model. Restructuring tool orchestration, context assembly, or retry logic moves the needle more than swapping models. So why not optimize the harness systematically? Define your eval suite, run configurations against it, keep the winners.
The problem is Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. And the mechanism by which this happens in agent systems is specific and predictable.
An eval suite tests a finite set of behaviors across a finite set of scenarios. When you optimize a harness against that suite, you’re not optimizing for “good agent behavior” — you’re optimizing for “behavior that scores well on these specific tests.” The harness learns the eval’s blind spots. It finds configurations that satisfy the letter of each test case while drifting from the spirit. Not through malice — optimization against a fixed target always finds shortcuts.
The danger isn’t that your eval suite is bad. It’s that any fixed eval suite has a finite shelf life once it becomes an optimization target. The better your optimization loop, the faster your evals decay.
Consider a coding agent eval that measures correctness across a set of programming tasks. You optimize the harness — context window allocation, tool call ordering, retry logic, prompt templates — and after 50 iterations, pass rate climbs from 72% to 91%. Dashboard green. But the harness learned that your tasks are self-contained functions, that your correctness check doesn’t inspect intermediate steps, and that certain prompt patterns parse cleanly against your rubric. It optimized for the surface of your eval, not the capability your eval was trying to measure.
Deploy against real-world tasks — messier, more ambiguous, requiring multi-file reasoning your eval didn’t cover — and performance drops below the pre-optimization baseline. The harness overfit. Exquisitely tuned to the test distribution, fragile everywhere else.
The SEO Parallel
This is not hypothetical. We watched this exact dynamic hollow out web search quality over twenty years.
Google’s ranking algorithm was an eval suite for web page quality. PageRank measured something real: pages that many other pages linked to were probably valuable. Website operators discovered they could optimize against this signal. They built link farms, keyword-stuffed content, and backlink networks. Google responded with more sophisticated signals. Operators reverse-engineered those too. Each cycle produced content that scored better on Google’s metrics while delivering less value to users.
The result was a web filled with content optimized for ranking rather than reading. Google’s dashboard — click-through rates, time-on-page, query satisfaction scores — kept looking reasonable even as the actual user experience degraded. The metrics became reflections of the optimization process, not measurements of quality.
| Stage | SEO Parallel | Agent Eval Parallel |
|---|---|---|
| 1. Measure | PageRank captures real quality signal | Eval suite captures real agent capability |
| 2. Optimize | Websites optimize for ranking factors | Harness optimized against eval suite |
| 3. Exploit | Link farms, keyword stuffing | Harness learns eval blind spots and shortcuts |
| 4. Respond | Google adds new ranking signals | Engineers add new eval cases |
| 5. Decay | Content quality degrades while metrics hold | Agent reliability degrades while pass rates hold |
| 6. Repeat | New signals get gamed too | New evals get optimized against too |
The critical insight is at stage 5: metrics hold while quality degrades. This is why the problem is insidious. Your eval pass rate is climbing. Your harness configurations are improving by every measure you track. But the measures themselves have been compromised by the optimization process. You’re navigating by a compass that your own movement is magnetizing.
Why “More Evals” Doesn’t Fix It
The instinctive response is to expand the eval suite. Add more test cases, cover more edge cases, test more dimensions. This helps temporarily — each new eval provides a genuine signal until it too becomes an optimization target. But it’s a linear defense against an exponential problem.
Every eval you add increases the surface area of your test suite. But the harness optimization process is searching a vast configuration space and can find shortcuts across the entire surface simultaneously. Adding 50 new eval cases after each optimization round doesn’t reset the game — it just makes each round marginally more expensive before the same dynamic reasserts itself.
There’s a deeper issue. Expanding an eval suite requires knowing what to test for. The most dangerous failure modes are the ones you didn’t anticipate — the scenarios your eval suite doesn’t cover precisely because you didn’t think of them. An optimization process that’s tuned against your known test cases is, by definition, unexplored against your unknown failure modes. You’re building a fortress on the side of the mountain where you can see the enemy, while the actual attack comes from the side you didn’t think to defend.
The “more evals” instinct treats evaluation as a coverage problem. The real problem is adversarial dynamics: your optimization loop is an opponent that adapts to whatever you measure. Coverage cannot outrun adaptation.
Adversarial Eval Co-Evolution
The fix isn’t better evals or more evals. It’s evals that evolve in response to optimization — an adversarial process where a separate system actively searches for inputs that cause the optimized harness to fail.
The structure looks like this:
┌─────────────────────────────────────────────────┐
│ Agent Harness │
│ Optimized against current eval suite │
└──────────────────┬──────────────────────────────┘
│ produces
▼
┌─────────────────────────────────────────────────┐
│ Eval Results + Traces │
│ Pass rates, failure patterns, execution logs │
└──────────┬───────────────────────┬──────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────────────┐
│ Harness Tuner │ │ Red Team Generator │
│ Adjusts config │ │ Generates new evals │
│ to improve │ │ designed to break the │
│ pass rates │ │ current harness │
└──────────┬───────┘ └──────────┬──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────┐
│ Updated Eval Suite │
│ Old evals + adversarially generated new ones │
└─────────────────────────────────────────────────┘
The red team generator is a separate agent — it can be a different LLM instance — that has access to three things: the current eval suite, the current harness configuration, and the execution traces from recent eval runs. Its job is to generate new test cases specifically designed to exploit patterns in how the harness was optimized. If the harness learned to shortcut on self-contained function tasks, the red team generates multi-file tasks. If the harness optimized its retry logic for a particular error pattern, the red team generates scenarios with novel error types.
This creates a co-evolutionary dynamic rather than an optimization-against-fixed-target dynamic. The eval suite and the harness adapt to each other. Neither reaches a stable equilibrium because each improvement by one side creates pressure on the other. This is computationally more expensive than static eval optimization, but it produces systems that are genuinely robust rather than superficially well-scored.
The biological parallel is the Red Queen hypothesis: parasites and hosts co-evolve because neither can stop adapting without being overtaken by the other. A static immune system gets destroyed by evolving pathogens. A static eval suite gets destroyed by an evolving harness. The only stable strategy is continuous co-evolution.
Why This Gets Worse With Evolutionary Search
The adversarial dynamic becomes acute when harness optimization moves from manual tuning to automated search. The trajectory is already visible: define a genome of configurable harness parameters — memory type, context management strategy, tool access policy, compression triggers, retry logic, prompt templates — and let configurations compete through eval-based fitness. Evolutionary search over harness architectures, where configurations mutate, recombine, and get selected based on eval scores.
This is a powerful idea. It’s also the fastest way to trigger eval autoimmunity. Each generation of harness configurations is selected for eval performance, which is selection pressure for finding eval exploits. Manual tuning might take weeks to overfit an eval suite. Evolutionary search can do it in hours. The stronger the optimization, the faster the eval degrades — and automated search is orders of magnitude stronger than human iteration.
This makes adversarial eval co-evolution non-optional. If you’re running any form of automated harness optimization — evolutionary, bayesian, grid search, anything — a static eval suite isn’t just a weak foundation. It’s an active liability. The optimization will find every shortcut your evals leave open, and it will find them fast.
The Process Dimension
Adversarial eval generation attacks the problem from one direction: making evals harder to game by evolving them. Process constraints attack it from a complementary direction: making the optimization surface itself harder to shortcut.
Current agent evals overwhelmingly measure outcomes. Did the agent produce the correct answer? Did the code pass the test suite? But an agent that produces the correct answer by accessing systems it shouldn’t have, making 47 redundant API calls, or taking a path that would be catastrophic if the input were slightly different — that agent passes the outcome eval while being genuinely dangerous.
Clinical trials don’t just measure whether a drug cures the disease — they measure adverse events along the way. Agent evals need the same structure: outcome metrics paired with process constraints. Did the agent stay within its authorized tool set? Did it maintain reasonable resource consumption? Did its intermediate steps follow defensible logic?
Process constraints and adversarial eval generation work together. Process evals constrain the path, not just the destination — a harness that shortcuts on outcomes gets caught by process constraints. Adversarial generation ensures neither surface stays static long enough to be gamed. Together, they force the optimization to satisfy two co-evolving objective surfaces, which raises the cost of gaming dramatically.
Pair every outcome eval with at least one process constraint. “Did the agent get the right answer?” is necessary but insufficient. “Did the agent get the right answer without accessing unauthorized tools, exceeding token budgets, or taking steps that would be dangerous on adjacent inputs?” is a much harder target to game.
What This Means for Practitioners
If you’re building eval suites for production agents, three principles should guide your investment.
Treat eval suites as depreciating assets. Every eval has a shelf life that shortens the more aggressively you optimize against it. Budget for continuous eval regeneration, not one-time eval creation. If your eval suite hasn’t changed in a month but your harness has been optimized three times, your metrics are probably lying to you.
Build adversarial eval generation into your pipeline. This doesn’t require a sophisticated system on day one. Start with a simple version: after each optimization cycle, have a separate LLM review the harness changes and the eval results, then generate 10 new test cases designed to stress the specific configurations that changed. Even this minimal approach disrupts the optimization-against-fixed-target dynamic.
Measure process, not just outcomes. Add resource consumption, tool authorization, intermediate step coherence, and behavioral consistency to your eval dimensions. These process metrics are harder to shortcut and they catch failure modes that outcome metrics miss entirely — the silent, confident, wrong-but-scored-as-right failures that erode trust in production.
Know where static evals are fine. Not every agent system triggers this dynamic. Narrow-scope agents with fixed task definitions and stable toolsets — a data extraction pipeline, a form-filling bot — can run against static evals indefinitely because the optimization surface is constrained enough that there aren’t meaningful shortcuts to find. The autoimmune dynamic emerges when the agent has broad autonomy, diverse tools, and open-ended tasks. The wider the agent’s capability surface, the more urgently you need evolving evals.
The teams that will ship reliable agents aren’t the ones with the largest eval suites. They’re the ones whose evals evolve as fast as their systems do. Static evaluation in a dynamic system isn’t rigor — it’s theater.