Evaluation Patterns for Deep Agents: Single-Step, Full-Turn, and Multi-Turn Testing

How to design layered evaluation strategies for long-horizon AI agents using single-step interrupts, full-turn assertions, and multi-turn simulations.

Evaluating a simple question-answering pipeline is straightforward: feed inputs, score outputs, repeat. Evaluating a deep agent—one that plans across many steps, mutates external state, and interacts with users over multiple turns—requires a fundamentally different approach. The uniform dataset-plus-evaluator pattern breaks down when every test case has its own success criteria, involves intermediate state you care about, and may span dozens of tool calls.

Why Standard Eval Pipelines Fall Short

Traditional LLM evaluation assumes every example is structurally identical: the same application logic processes each input, and the same scoring function judges each output. Deep agents violate this assumption in three ways.

First, what counts as “correct” is highly data-point-specific. A test case checking that a calendar agent remembered a scheduling preference needs to verify a file mutation, not just a string in the final message. A test case checking that a coding agent chose the right refactoring tool needs to inspect the tool call arguments, not the prose explanation. Writing a single generic evaluator that handles both is either too coarse to be useful or so complex it becomes unmaintainable.

Second, the surface area of what you can assert expands dramatically. For any given agent run you can check the trajectory (which tools were called, in what order, with what arguments), the final response, and arbitrary side effects like file contents, database writes, or memory store updates. Choosing which surface to assert against—and how—is itself a design decision.

Third, the cost of running a full agent for every test case adds up quickly. A deep agent might invoke ten or twenty tools per turn. Running the entire sequence just to check one early decision wastes tokens and slows feedback loops.

The Three Granularities of Agent Evaluation

A practical solution is to treat agent evaluation as a hierarchy of three granularities, each suited to a different class of question.

Single-step evaluation constrains the agent loop to produce exactly one action and then halts. You feed the agent a pre-constructed state—conversation history, memory contents, tool results already in context—and ask: given this exact situation, what does it decide to do next? This is the most efficient test format because it requires no tool execution and catches regressions at individual decision points. If your agent uses a graph-based runtime like LangGraph, you can inject a breakpoint before the tools node to intercept the chosen action before it executes.

def test_selects_correct_search_tool(agent_graph, checkpointer):
    state = build_state(
        messages=[HumanMessage(content="Find a 30-minute slot tomorrow afternoon")],
        memories="User prefers afternoons, never before 9 AM"
    )
    # Interrupt before tool execution
    config = {"recursion_limit": 1}
    result = agent_graph.invoke(state, config=config)
    tool_call = extract_next_tool_call(result)
    assert tool_call["name"] == "search_calendar"
    assert "tomorrow" in tool_call["args"]["query"]

Full-turn evaluation runs the agent to completion on a single input, allowing the full multi-step tool-calling loop to resolve. This is the right granularity for assertions about end state: did the memories file end up containing the right content? Did the agent’s final message confirm the action it took? You can combine deterministic assertions (regex, exact match, JSON schema checks) with LLM-as-judge scoring for dimensions that require semantic understanding.

Multi-turn evaluation runs the agent through a simulated conversation—multiple user inputs and agent responses in sequence. This is the highest-fidelity format but also the most expensive and the hardest to keep stable. The key engineering challenge is keeping the simulation “on rails”: the simulated user must respond in ways that stay within the scenario you’re testing, otherwise you’re measuring the behavior of a stochastic user simulator rather than your agent. One reliable approach is to precompute the user turns and replay them deterministically rather than generating them live.

Tip

Aim for roughly half your test suite to be single-step cases. They run fast, are easy to debug, and catch the majority of regressions. Reserve full-turn and multi-turn cases for scenarios where the outcome genuinely depends on the complete execution sequence.

Writing Bespoke Test Logic Per Case

Because success criteria vary per test case, each test function becomes a small, focused program rather than a parameterized call to a shared evaluator. A test for “agent remembers a user preference” might look like this:

def test_remember_preference():
    response = run_agent("Never schedule meetings before 9 AM")
    tool_calls = extract_tool_calls(response)

    # Deterministic assertion: correct tool called on correct file
    assert any(
        tc["name"] == "edit_file" and tc["args"]["path"] == "memories.md"
        for tc in tool_calls
    ), "Agent did not update memories.md"

    # Read back the mutated file and check its content
    memory_content = read_agent_file("memories.md")
    assert "9" in memory_content or "nine" in memory_content.lower()

    # LLM-as-judge for semantic quality of user-facing confirmation
    score = llm_judge(
        criterion="Did the agent clearly confirm the preference was saved?",
        response=response.final_message
    )
    assert score >= 0.8

This structure—deterministic assertions first, then LLM-as-judge for semantic dimensions—keeps tests fast to run in the common (passing) case while still catching nuanced failures.

Note

Layer your assertion strategy: use deterministic checks (tool name, file path, argument values) for structural correctness and LLM-as-judge only for dimensions where exact matching is insufficient, like tone, completeness, or semantic accuracy. This keeps costs low and failure modes interpretable.

Environment Setup and Reproducibility

Deep agents mutate state—files, databases, memory stores, external services. If tests share state across runs, earlier tests corrupt the environment for later ones, producing flaky results that are nearly impossible to debug.

Every test case must start from a clean, reproducible environment. In practice this means:

Fixture-based setup and teardown: provision a fresh memory store, temporary directory, or sandbox database before each test and destroy it after.
Snapshot-based environments: for agents that operate on codebases or file trees, check out a known commit or restore from a snapshot before each run.
Hermetic tool mocks for external APIs: calendar APIs, email services, and web search should be mocked or sandboxed so that external state doesn’t bleed into your evaluation.

Test Runner
    │
    ├─ setup_fixture() ──► Fresh Environment
    │                        ├── memory.md (empty)
    │                        ├── calendar_api (mock)
    │                        └── file_sandbox (temp dir)
    │
    ├─ run_agent(input) ──► Agent Execution
    │                        ├── step 1: tool call
    │                        ├── step 2: tool call
    │                        └── step N: final response
    │
    ├─ assert_*(response, env_state)
    │
    └─ teardown_fixture() ──► Destroy Environment

Environment isolation also makes it safe to run test cases in parallel, which is important when a full-turn eval suite might otherwise take tens of minutes.

Putting the Layers Together

A mature deep-agent eval suite is a deliberate mix of all three granularities. Single-step tests form the fast-feedback inner loop, catching decision-point regressions in seconds during development. Full-turn tests run in CI to verify end-state correctness. Multi-turn tests run less frequently—perhaps nightly or before a release—to validate that the agent holds up across realistic conversation flows.

The shift in mindset from traditional LLM evaluation is significant: instead of one dataset and one evaluator, you are writing a test suite in the same sense that a software engineer writes unit, integration, and end-to-end tests. The tooling overlaps too—pytest fixtures, assertion libraries, and coverage tracking all apply directly. Treating agent evaluation as software testing, rather than as a data-science scoring exercise, gives you the iteration speed and debuggability that production-grade agents require.

Warning

Avoid the temptation to rely entirely on LLM-as-judge scoring for agent evals. When a test fails, you need to know why—which tool was called with the wrong argument, which file was not updated, which turn the conversation went off script. Deterministic assertions produce actionable failure messages; aggregate scores do not.