Cost-Driven Agent Architecture
How token economics — compound costs from multi-step reasoning, tool loops, and retry cascades — shape architectural decisions as agents move to production.
As organizations move agents from prototypes to production, token costs compound in ways that single-shot LLM pricing doesn’t predict. An agent that costs $0.15 per invocation in testing can easily cost $2-5 in production — multi-step reasoning chains, tool call loops, context window accumulation, and retry cascades all multiply the bill.
This cost pressure is already shaping architectural decisions. Model routing, context budgeting, speculative execution, and tiered reasoning are emerging not as optimizations but as structural requirements for running agents at scale.
The Compounding Cost Problem
The cost structure of agent systems is fundamentally different from single-shot LLM calls. In a traditional API call, cost scales linearly with input and output tokens. In an agent system, costs compound through multiple mechanisms.
Reasoning chains: A ReAct-style agent might take 5–15 steps to complete a task. Each step includes the full conversation history, growing the context window with every iteration. By step 10, you’re paying for the cumulative context of all previous steps.
Tool call overhead: Every tool call adds tokens — the tool definition, the call parameters, the response, and the model’s interpretation of that response. An agent with 20 available tools pays a “tool tax” on every single inference call, whether or not it uses those tools.
Retry and error recovery: Production agents hit errors, timeouts, and ambiguous results. Each retry means re-processing the full context. Agents designed for reliability often include explicit retry logic that can double or triple the token budget for a single task.
Step 1: [System + Tools + Query] → ~4K tokens Step 2: [System + Tools + Query + Step 1] → ~6K tokens Step 3: [... + Step 2] → ~9K tokens ... Step 10: [... + Steps 1-9] → ~35K tokens Cumulative cost: Not 10× Step 1, but ~25× Step 1
The arithmetic is straightforward but the implications are severe. A 10-step agent invocation doesn’t cost 10× a single call — it costs roughly 25×, because each step re-processes all prior context. Add tool definitions, error handling, and the occasional retry loop, and production agents routinely consume 50–100× the tokens of the naive estimate.
Larger context windows don’t reduce costs — they enable larger costs. A 200K-token context window means an agent can accumulate 200K tokens of context, and under current architectures, it often will. Bigger windows without budget discipline just mean bigger bills.
The Model Routing Imperative
The most impactful architectural shift driven by cost pressure will be model routing: dynamically selecting which model handles each step of an agent’s execution based on task complexity.
Consider a typical coding agent workflow:
- Parse the user’s request (simple comprehension)
- Search the codebase for relevant files (tool orchestration)
- Understand the code’s architecture (deep reasoning)
- Generate the fix (high-capability generation)
- Write a commit message (simple generation)
Steps 1, 2, and 5 don’t require a frontier model. A smaller, faster, cheaper model handles them equally well. Step 3 might need a mid-tier model. Only step 4 genuinely benefits from the most capable model available.
| Task Type | Model Tier | Relative Cost | Examples |
|---|---|---|---|
| Classification and routing | Small (Haiku-class) | 1× | Intent detection, tool selection, simple parsing |
| Synthesis and summarization | Mid (Sonnet-class) | 5–10× | Code comprehension, context summarization, planning |
| Complex reasoning and generation | Large (Opus-class) | 20–50× | Novel code generation, complex debugging, architecture decisions |
Early adopters of model routing report 60–80% cost reductions on agent workloads with no measurable quality degradation. The key insight is that most agent steps are not the hard step. The architecture should reflect that reality.
The implementation challenge is building the router itself. A simple heuristic router (route by step type) captures most of the value. A learned router (trained on historical step-outcome data) captures more, but introduces its own complexity and failure modes. For most teams, starting with heuristic routing and iterating toward learned routing is the pragmatic path.
Context Budgeting as a First-Class Concern
Today, most agent frameworks treat context as unlimited — or at least, as someone else’s problem. Tools dump their full output into the context. Previous conversation turns accumulate without pruning. System prompts grow as capabilities are added.
In cost-aware agent architectures, context becomes a managed resource with an explicit budget. This requires three shifts in how agents handle information.
Aggressive summarization: Instead of carrying full conversation history, agents summarize completed sub-tasks and discard the raw steps. A 10-step tool-use chain that consumed 15K tokens becomes a 200-token summary of what was found and decided.
Selective tool output: Tool responses are filtered or truncated before entering the context. A file search that returns 50 results gets trimmed to the 5 most relevant. A database query result gets summarized rather than included verbatim. The agent runtime, not the model, makes this decision.
Tiered context hierarchies: Drawing from memory systems research, agent context can be organized into tiers — hot context (current task state), warm context (recent results, summarized), and cold context (retrievable from external storage when needed).
┌──────────────────────────────────────┐
│ HOT CONTEXT │
│ Current task state + active tools │
│ (~2-4K tokens, always present) │
├──────────────────────────────────────┤
│ WARM CONTEXT │
│ Summarized recent steps + results │
│ (~1-2K tokens, compressed) │
├──────────────────────────────────────┤
│ COLD CONTEXT │
│ External memory / vector store │
│ (Retrieved on-demand, 0 baseline) │
└──────────────────────────────────────┘
Budget: Fixed per-step token ceiling
enforced by the agent runtime This isn’t just an optimization — it’s a different mental model. Instead of asking “how much context can I fit?”, cost-aware architectures ask “what’s the minimum context needed for this step to succeed?”
Speculative Execution: The Frontier Pattern
The most architecturally interesting pattern emerging from cost pressure is speculative execution: using a cheaper model to attempt a solution first, then escalating to a more expensive model only when the cheap attempt fails quality checks.
This mirrors speculative execution in CPU design, where the processor guesses the likely branch path and only pays the full cost if the guess was wrong. In agent systems, the pattern looks like:
- A small, fast model attempts the task
- A classifier evaluates whether the output meets quality thresholds
- Only if the check fails does the system escalate to a larger model
For tasks where the small model succeeds 70–80% of the time, speculative execution cuts average cost dramatically while maintaining quality on the harder cases. The math is compelling: if 75% of tasks resolve at 1/20th the cost, and 25% escalate to full price, the blended cost drops to roughly 30% of the always-use-the-big-model approach.
This pattern is already visible in production systems. Code editors use fast models for autocomplete and reserve larger models for complex edits. Agent runtimes route between model tiers based on task complexity estimates. The pattern is converging toward explicit, framework-level support rather than ad-hoc implementation.
The critical engineering challenge is building reliable quality classifiers — the “branch predictor” of agent execution. A bad classifier either wastes money on unnecessary escalation (too conservative) or ships low-quality results (too aggressive). Early approaches combine heuristic checks (output format, length, tool call validity) with lightweight model-based evaluation. As these classifiers improve, speculative execution will become a standard primitive in agent runtimes.
What This Means for Practitioners
If you’re building agent systems today, cost constraints are coming whether or not you plan for them. A few concrete steps to prepare.
Instrument before you optimize. You can’t manage what you can’t measure. Add token-level cost tracking to every agent step — not just total cost per invocation, but cost per step, per tool call, per retry. Most agent observability platforms support this now, but few teams actually use it. Start collecting the data even if you don’t act on it immediately.
Design for model substitutability. Don’t hard-code model choices deep in your agent logic. Build your runtime so that any step can be routed to a different model without changing the agent’s behavior specification. This means standardizing on common message formats and avoiding model-specific features in your core agent loop.
Set context budgets early. Even if you’re not optimizing costs yet, establishing a per-step context budget forces better architectural decisions. An agent that works within an 8K-token step budget is inherently more disciplined — and more debuggable — than one that relies on 200K tokens of accumulated context.
Profile your cost distribution. Most teams discover that 80% of their token spend comes from 20% of their agent steps — usually the ones involving large tool outputs or deep reasoning chains. Find those steps first.
| Strategy | Implementation Effort | Typical Cost Reduction | Quality Risk |
|---|---|---|---|
| Model routing | Medium | 60–80% | Low (with good classifiers) |
| Context summarization | Medium | 30–50% | Medium (information loss) |
| Tool output filtering | Low | 20–40% | Low |
| Prompt caching | Low | 10–30% | None |
| Speculative execution | High | 40–70% | Medium (classifier dependent) |
The Architecture That Emerges
The agent architecture that dominates in 2027 won’t look like today’s single-model, unbounded-context systems. It will look more like an operating system’s resource manager: allocating model compute, context memory, and tool access based on explicit budgets and priorities.
This isn’t a retreat from capability — it’s the maturation that every computing paradigm goes through. Mainframes gave way to time-sharing. Monolithic applications gave way to microservices. Single-model agents will give way to budget-aware, multi-model runtimes that achieve the same outcomes at a fraction of the cost.
The teams that build for this future now — instrumenting costs, abstracting model selection, managing context as a finite resource — will have a meaningful advantage when the economics of agent deployment become impossible to ignore. The teams that don’t will find themselves rewriting their agent architectures under production pressure, which is never where you want to be.