MASEval: Why Your Agent Benchmark Is Missing Half the Picture

How to evaluate entire agentic systems—framework, model, and orchestration together—rather than treating model choice as the only variable that matters.

Most agent benchmarks are secretly model benchmarks in disguise. They fix the scaffolding, vary the model, and declare a winner—but the scaffolding itself carries enormous performance weight. When you swap LangGraph for AutoGen or change how tool results are fed back into context, task success rates can shift by as much as swapping between frontier models. Treating the evaluation unit as “the model” obscures that reality.

The Unit-of-Analysis Problem

When you deploy an agent in production, users don’t interact with a model—they interact with a system. That system includes a framework that manages the agent loop, a prompt template that shapes how the model reasons, a tool-calling convention that serializes and deserializes structured data, and retry or error-handling logic that determines what happens when something goes wrong. All of these components affect the observable outcome: did the agent complete the task correctly?

Conventional benchmarks sidestep this complexity by controlling every variable except the model. That’s experimentally clean, but it’s evaluating something you’ll never actually ship. The moment you change frameworks, upgrade your orchestration library, or refactor how you inject tool results, your benchmark numbers are no longer representative of your system’s real behavior.

Warning

A benchmark score that controls for framework is measuring model capability, not system capability. If your production agent runs on a specific framework, only system-level evaluation tells you what users will actually experience.

Framework Choice Is a First-Class Variable

Consider what differs across popular agent frameworks even when the underlying model and task are identical:

Prompt formatting: How tool schemas are serialized into the system prompt, whether examples are included, and how assistant turns are structured.
Loop management: When the framework decides to stop, how it handles max-iteration limits, and whether it retries on parse errors.
Tool result injection: Whether tool outputs are injected as user messages, function results, or embedded in the next assistant turn.
State handling: How intermediate reasoning is preserved, truncated, or summarized across long trajectories.

Each of these choices interacts with model-specific behaviors. A model that was trained to expect OpenAI-style function-call formatting may underperform when a framework presents tool schemas as raw JSON in a system prompt. A framework that aggressively truncates context may hurt a model that relies on extended chain-of-thought but help a model that struggles with long contexts.

Same Task, Different Systems
─────────────────────────────────────────────────────

  Task Input
      │
      ├──► [ Model A + Framework X ] ──► Result: ✓
      │
      ├──► [ Model A + Framework Y ] ──► Result: ✗
      │
      ├──► [ Model B + Framework X ] ──► Result: ✗
      │
      └──► [ Model B + Framework Y ] ──► Result: ✓

  Framework effect ≈ Model effect in magnitude
  No single axis explains performance alone

Designing a System-Level Evaluation

A framework-agnostic evaluation library needs to solve a tricky interface problem: it must measure outcomes without coupling to any particular framework’s internals. The practical approach is to define evaluation at the boundary of the system—what goes in (task specification, initial state) and what comes out (final answer, action trace, tool call sequence)—and leave everything in between to the system under test.

This means your harness should:

Define tasks as input/output contracts, not as framework-specific configurations. A task is a natural-language goal plus the ground truth or success criteria.
Capture the full action trace, not just the final answer. Many agent failures are invisible if you only check the last output—the agent may have hallucinated a tool call, ignored a tool result, or looped unnecessarily.
Run the same task across the full (model × framework) matrix in a single evaluation pass so you can attribute variance to the right axis.
Aggregate metrics at the system level. Report success rate, tool call efficiency, and error patterns per (model, framework) pair, not just per model.

# Minimal system-level evaluation interface
class AgentSystemEvaluator:
    def run_task(self, system: AgentSystem, task: Task) -> EvalResult:
        trajectory = system.run(task.input)
        return EvalResult(
            success=task.check(trajectory.final_answer),
            steps=len(trajectory.actions),
            tool_calls=trajectory.tool_calls,
            errors=trajectory.errors,
        )

    def compare_systems(
        self,
        systems: list[AgentSystem],
        tasks: list[Task],
    ) -> ComparisonReport:
        results = {
            s.name: [self.run_task(s, t) for t in tasks]
            for s in systems
        }
        return ComparisonReport(results)

Tip

When building your evaluation matrix, include at least one “easy” benchmark where you expect all systems to score high. Divergence on easy tasks is a strong signal of framework-level bugs or prompt formatting mismatches, not model capability gaps.

What This Means for Production Teams

The practical implication is that your evaluation pipeline should be a first-class part of your framework selection process, not an afterthought. Before committing to a framework:

Run your task suite against at least two frameworks with your target model to establish a baseline delta.
Instrument the action trace to understand why one system outperforms another—is it fewer wasted tool calls, better error recovery, or more compact context?
Lock your framework version in CI the same way you lock your model version. A patch release that changes prompt formatting can silently degrade performance.

For teams running multi-agent systems, the combinatorial complexity grows quickly. Frameworks often handle inter-agent communication, message routing, and shared memory differently, so a framework upgrade in one part of the pipeline can affect the behavior of agents that weren’t changed at all.

Building Toward Reproducible Agent Benchmarks

The broader principle here is that reproducibility in agent evaluation requires fully specifying the system, not just the model. When you publish internal benchmark results or compare against public leaderboards, the numbers are only meaningful if the framework, version, and configuration are part of the specification.

This mirrors a maturation that happened in traditional software benchmarking: it’s not enough to benchmark an algorithm; you benchmark the algorithm running in a specific runtime, on a specific hardware profile, under a specific load pattern. Agent systems are no different. The scaffolding isn’t a neutral carrier—it’s an active participant in the outcome, and your evaluations should treat it that way.