danielhuber.dev@proton.me Wednesday, April 8, 2026

Designing Agent Action Spaces: How to Build the Right Tools for Your AI Agent

A practitioner's guide to designing, scaling, and evolving the tool sets that define what AI agents can do — drawing on production lessons from Claude Code, research on tool scaling limits, and emerging patterns like Tool RAG and progressive disclosure.


March 1, 2026

One of the hardest problems in agent engineering is not reasoning, retrieval, or even planning. It is deciding what an agent can do. The set of tools you hand to a model — its action space — determines everything from task success rates to token costs to how often the agent gets stuck in dead ends. Get it right and the agent feels like a capable collaborator. Get it wrong and it thrashes, picks the wrong tool, or drowns in options it never needed.

This is a design problem, not a scaling problem. And as production agent systems have matured through 2025 and into 2026, practitioners have converged on a set of principles that were hard-won through iteration, failure, and careful observation of model behavior.

The Core Insight

Design your tools to match the model’s abilities, not the problem’s complexity. The right action space is not the one that covers every possible operation — it is the one the model can actually use well.

The Tool Scaling Problem

A naive approach to agent design adds a tool for every capability the agent might need. Twenty tools becomes fifty. Fifty becomes two hundred. This feels productive — more tools, more capability — but the data tells a different story.

Research from the RAG-MCP project measured tool selection accuracy as the number of available tools increased from 1 to 11,100. Without intervention, tool selection accuracy collapsed to 13.62% on benchmark tasks. The model could not reliably pick the right tool out of a large set, regardless of how well each tool was described.

This is not a quirk of a single model. The Berkeley Function Calling Leaderboard (BFCL) and other benchmarks confirm the pattern: as tool count grows, selection accuracy degrades, latency increases, and token costs balloon because every tool definition is sent with every request.

Tool Count vs. Agent Performance
Performance
  ▲
100%│  ●●●
  │      ●●
  │          ●●
  │              ●●
  │                  ●●●
  │                       ●●●●●
  │                               ●●●●●●
  │                                       ●●●●●●●●●●
  └──────────────────────────────────────────────────── Tools
  0    5    10   20   50   100  500  1000  10000

◄─ Sweet Spot ─►◄── Diminishing Returns ──►◄── Degradation ──►

The math is straightforward. If you have 50 tools averaging 3,000 tokens each, you are spending 150,000 tokens per request just on tool definitions — before the user’s message, conversation history, or any retrieved context. That cost repeats on every turn of the agent loop.

Principle 1: Fewer, Sharper Tools

The first lesson from production systems is counterintuitive: fewer tools outperform more tools. Anthropic’s engineering team found that Claude Code works well with roughly 20 tools, and the bar to add a new one is deliberately high. Each additional tool gives the model one more option to consider, increasing the likelihood of misselection.

This does not mean limiting capability. It means consolidating related operations into well-designed tools rather than exposing thin wrappers around every API endpoint.

Tool consolidation: thin wrappers vs. unified tools
Thin Wrappers (Avoid)Unified Tool (Prefer)
list_users, get_user, search_usersfind_users(query, filters)
list_events, create_event, update_eventschedule_event(action, params)
read_file, read_file_lines, read_file_rangeread(path, offset?, limit?)

OpenAI’s agent design guidance echoes this: keep each tool focused on a single read or write action, but design that action to handle the full workflow rather than fragmenting it across multiple tools. The tool should match how a human would describe the task, not how the underlying API is structured.

Principle 2: Design for the Model, Not the API

Thariq Shihipar from Anthropic describes a useful thought experiment: imagine being given a difficult math problem, and ask what tools you would want. A piece of paper is the minimum, but you are limited by manual calculation. A calculator is better, but you need to know how to operate it. A computer is the most powerful option, but you need to know how to code.

This is the right framework. You want tools shaped to the model’s abilities — and those abilities are specific and observable. Models are excellent at generating structured JSON, following schemas, and reasoning about which tool to call. They are poor at maintaining state across many parallel tool calls, handling ambiguous tool boundaries, and recovering from opaque error messages.

Practical implications:

  • Tool descriptions are prompts. Anthropic’s internal testing showed that prompt-engineering tool descriptions was one of the most effective levers for improving agent performance. Describe tools as you would explain them to a new team member — make implicit context explicit.
  • Error messages are steering mechanisms. When a tool call fails, the error response shapes the model’s next action. Return specific, actionable guidance (“No user found with email X. Try searching by name instead.”) rather than error codes or stack traces.
  • Return semantically rich data. Replace cryptic identifiers with human-readable names. Include only high-signal fields. Use a response_format parameter to let the model choose between detailed and concise output.
Let Claude Write Your Tools

Anthropic’s “Writing Effective Tools for AI Agents” blog documents a powerful feedback loop: build an evaluation suite, run your agent against it, then paste the transcripts into Claude and let it suggest tool improvements. Internal testing showed Claude-optimized tools significantly outperformed human-written implementations on held-out test sets.

Principle 3: Progressive Disclosure

Not every capability needs to be a tool. Progressive disclosure is the pattern of letting agents discover context through exploration rather than loading everything upfront.

When Claude Code first launched, it used a RAG vector database to provide context. This required indexing, setup, and was fragile across environments. More importantly, Claude was given context rather than finding it. As models improved, Anthropic replaced this with search tools (Grep, Glob) that let Claude build its own context.

The Agent Skills pattern takes progressive disclosure further. Instead of registering 50 tool definitions, you give the agent access to a skills directory. Each skill file contains instructions and references to other files. The agent reads what it needs on demand — exactly like a developer reading documentation.

Progressive Disclosure Layers
Layer 0: System Prompt
│  Always loaded. Minimal tool set (~20 tools)
│  Cost: fixed per request
│
▼
Layer 1: Skill Metadata
│  Short descriptions of available skills (~50 tokens each)
│  Loaded on demand when the agent scans the skills directory
│
▼
Layer 2: Skill Instructions
│  Full instructions for the selected skill (~1000 tokens)
│  Loaded only when the agent selects a specific skill
│
▼
Layer 3: Referenced Files
│  Examples, schemas, documentation
│  Loaded only when the skill instructions reference them
│
▼
Layer 4: Subagent Delegation
   Spawn a specialized agent with its own focused tool set
   Context is isolated — parent never sees the details

The Claude Code Guide subagent is a concrete example. Rather than stuffing documentation about Claude Code’s own features into the system prompt (which would add context rot and distract from the primary task of writing code), the team built a specialized subagent that Claude invokes when users ask questions about itself. The subagent has extensive instructions for searching docs and returns concise answers. This added capability to the action space without adding a tool.

The token savings are dramatic:

Traditional:  50 tools × 3,000 tokens = 150,000 tokens/request
Skills:       20 tools + 1 loaded skill = ~23,000 tokens/request
Savings:      ~85%

Principle 4: Tools Evolve with Models

What works for one model generation may not work for the next. The Anthropic team learned this directly when they replaced the TodoWrite tool with the Task tool.

TodoWrite was designed to keep Claude on track by maintaining a checklist. It worked, but the team also had to inject system reminders every 5 turns to prevent Claude from forgetting its todos. As models improved, this scaffolding became counterproductive. The reminders made Claude think it had to stick rigidly to the list instead of adapting. When Opus 4.5 gained better subagent coordination abilities, the single-agent todo list became a bottleneck.

The replacement — the Task tool — shifted the design goal from “keep the model on track” to “help agents communicate with each other.” Tasks support dependencies, cross-agent updates, and deletion. The tool evolved because the model’s capabilities evolved.

A similar evolution happened with elicitation. The AskUserQuestion tool went through three iterations:

  1. Attempt 1: Bolting onto ExitPlanTool. Adding a questions array to the plan tool confused Claude — it could not simultaneously present a plan and ask questions about it.
  2. Attempt 2: Modified output format. Asking Claude to output structured markdown for questions was unreliable. Claude would append extra sentences, omit options, or deviate from the format.
  3. Attempt 3: Dedicated tool. A standalone tool with structured input (question text, multiple-choice options) worked. Claude understood how to call it, produced consistent outputs, and users could answer quickly.

The lesson: even the best-designed tool does not work if the model does not understand how to call it. Tool design is empirical — you have to observe the model’s behavior and iterate.

Principle 5: Tool RAG for Large Registries

For systems that genuinely need hundreds of tools — enterprise integrations, MCP server aggregators, multi-domain agents — the emerging solution is Tool RAG. This applies the same retrieval-augmented generation pattern used for knowledge to tool definitions themselves.

Instead of sending all tool definitions with every request, Tool RAG embeds tool descriptions in a vector database and retrieves only the most relevant ones for each query. The RAG-MCP framework demonstrated this at scale:

RAG-MCP benchmark results: tool selection with and without retrieval
ApproachTool Selection AccuracyPrompt Tokens
All tools in context (baseline)13.62%100%
RAG-MCP (retrieval-based)43.13%~50%

Red Hat’s ToolScope library (released February 2026) provides an open-source implementation. The architecture is straightforward: embed tool descriptions at registration time, perform semantic search against the user query at runtime, and include only the top-k tool definitions in the prompt.

The MCP ecosystem makes this particularly relevant. MCP servers grew from zero to 17,000+ implementations in just over a year since Anthropic’s November 2024 release. No agent can load all available MCP tools simultaneously — retrieval-based selection is becoming mandatory at this scale.

A Decision Framework

When designing your agent’s action space, work through these questions in order:

Action Space Design Decision Tree
START: Does the agent need this capability?
│
├─ No → Don't add it
│
└─ Yes → Can it be achieved with existing tools?
     │
     ├─ Yes → Write better tool descriptions or examples
     │
     └─ No → Can it be handled via progressive disclosure?
          │
          ├─ Yes → Add a skill file, not a tool
          │
          └─ No → Is it needed on every request?
               │
               ├─ No → Use Tool RAG or conditional loading
               │
               └─ Yes → Add a tool, but consolidate with related tools
                         │
                         Test: Does accuracy degrade with the new tool?
                         │
                         ├─ No → Ship it
                         └─ Yes → Rethink the design

Practical Checklist

For each tool in your agent’s action space, verify:

  1. Necessity. Remove any tool that has not been called in evaluation runs. Unused tools are not free — they consume tokens and attention.
  2. Clarity. Can you explain what the tool does in one sentence? If not, it is too complex or too vague.
  3. Atomicity. Does the tool do one thing well? Complex tools with many modes confuse models. But the unit of atomicity should be a workflow step, not an API call.
  4. Error quality. Do failed calls return messages that help the model self-correct? Test this explicitly.
  5. Output efficiency. Does the tool return only what the model needs? Large JSON payloads waste context. Implement pagination, filtering, and concise response modes.
  6. Evaluation coverage. Do your evals test tool selection, not just tool execution? A tool that works perfectly but is never selected is useless.
The Art of Seeing Like an Agent

Designing tools is as much art as science. It depends on the model you use, the goal of the agent, and the environment it operates in. The only reliable method is to experiment, read your agent’s outputs, and iterate. Pay attention to where the model hesitates, which tools it avoids, and where it makes incorrect selections. That observational data is more valuable than any design principle.

Further Reading

Tags: tool-useagent-designaction-spaceprogressive-disclosuretool-ragfunction-calling

This article is an AI-generated summary. Read the original paper: Lessons from Building Claude Code: Seeing like an Agent — Thariq Shihipar .