Agent Selection as a Recommendation Problem: Benchmarking Query-to-Agent Routing

How to frame agent selection as a structured recommendation problem, and what a rigorous benchmark for that task looks like.

When you have dozens or hundreds of specialized agents available, the question “which agent should handle this request?” stops being trivial. Agent selection — routing an incoming query to the right agent or combination of agents — is one of the least-studied engineering problems in multi-agent systems, yet it directly determines whether a deployed system is reliable or chaotic. Treating it as a principled recommendation problem, rather than hand-rolled if/else logic, opens up a much richer design space.

Why Agent Selection Is Harder Than It Looks

Most teams start with simple routing: keyword matching, embedding similarity against agent descriptions, or a single dispatcher LLM that reads a system prompt listing available agents. These approaches work in demos but degrade quickly in production. The fundamental difficulty is that the mapping from a natural-language query to the right agent is many-to-many and context-dependent.

Consider a query like “Summarize the latest earnings call and flag any regulatory risks.” This might be best handled by a single compositional agent that chains document retrieval with a compliance classifier, or it might be split across a summarization agent and a separate risk-analysis agent, or it might be passed to a general-purpose LLM with retrieval tools attached. The correct answer depends on what agents are actually deployed, their capabilities, their costs, and their latency profiles — none of which are static.

Beyond ambiguity, there is a coverage problem. As the agent catalog grows, no single human can maintain an accurate mental model of which agent handles which queries well. The selection layer needs to learn this mapping from interaction data, which means it needs a structured way to collect, store, and evaluate that data.

Framing Selection as Recommendation

Recommendation systems have solved structurally similar problems for decades: given a user and a context, rank a large catalog of items by predicted relevance. Agent selection maps cleanly onto this frame — the “user” is the incoming query, the “items” are available agents, and relevance is task success.

This framing unlocks a large toolkit. Collaborative filtering can identify that queries similar to past queries tended to succeed with agent A rather than agent B. Content-based filtering can match query semantics against agent capability descriptions. Hybrid approaches can combine both signals with explicit features like agent availability or cost.

The recommendation framing also makes evaluation tractable. Standard retrieval metrics — precision@k, recall@k, normalized discounted cumulative gain (NDCG) — apply directly. You can measure whether the correct agent appeared in the top-1, top-3, or top-5 recommendations, and you can weight hits by their position in the ranked list.

Note

Treating agent selection as recommendation lets you import decades of IR and RecSys evaluation methodology. NDCG is particularly useful because it penalizes putting the right agent at rank 3 instead of rank 1 — which matters when latency or cost favors calling only the top-ranked agent.

The Three Agent Archetypes

A useful taxonomy for any agent catalog distinguishes three types, each with different selection characteristics:

LLM-only agents rely entirely on a base or fine-tuned language model. Selection criteria are primarily about domain fit — does this model’s training distribution match the query domain? These are cheap to invoke and fail gracefully, so they are often good defaults.

Toolkit-only agents are thin wrappers around deterministic tools: a search API, a code executor, a database query layer. Selection is about capability matching — does this agent have access to the data source or action the query requires? Capability descriptions need to be precise enough for a selector to distinguish, say, a SQL agent from a vector-search agent.

Compositional agents chain multiple LLMs and tools into a pipeline. These are the most powerful but also the most expensive and brittle. A good selection layer should route to compositional agents only when simpler alternatives are unlikely to succeed.

Incoming Query
      │
      ▼
┌─────────────────┐
│  Selection Layer │
│  (ranker/router) │
└────────┬────────┘
         │  ranked candidate list
    ┌────┴─────────────────┐
    │          │           │
    ▼          ▼           ▼
LLM-only  Toolkit-only  Compositional
 agent      agent         agent
    │          │           │
    └────┬─────┘           │
         │  ◄──────────────┘
         ▼
    Task Result
         │
         ▼
  Interaction Log
  (query, agent, outcome)
         │
         ▼
  Selection Model
     Update

Building and Evaluating a Selection Layer

A production selection layer needs three components: a query encoder, an agent index, and a ranking model.

The query encoder produces a dense representation of the incoming request. A straightforward approach is embedding the raw query text, but richer encoders that extract intent, required capabilities, and expected output type tend to generalize better across agent types.

The agent index stores representations of each deployed agent. These can be embeddings of agent descriptions, but more reliable representations come from aggregating past query-outcome pairs — agents develop a usage signature that complements their declared capabilities.

# Minimal agent index entry
agent_record = {
    "id": "earnings-summarizer-v2",
    "description": "Retrieves and summarizes earnings call transcripts",
    "capability_tags": ["retrieval", "summarization", "finance"],
    "embedding": [...],           # from description
    "usage_embedding": [...],     # aggregated from past queries it handled
    "success_rate": 0.87,
    "avg_latency_ms": 4200,
    "cost_per_call_usd": 0.03,
}

The ranking model scores candidate agents against a query. At minimum this is cosine similarity between query and agent embeddings. More capable rankers use cross-encoders that attend jointly to the query and each agent record, or learning-to-rank models trained on interaction logs with binary or graded success labels.

Tip

When collecting interaction logs to train a ranker, record not just success/failure but why an agent failed — wrong capability, timeout, hallucination, missing data access. Structured failure codes let you train separate rerankers for different failure modes rather than a single noisy success signal.

Practical Engineering Considerations

Several issues arise when moving from a toy benchmark to a live selection layer.

Cold-start for new agents. A freshly deployed agent has no interaction history. Address this with content-based bootstrapping — use the agent’s description and capability tags alone until enough interactions accumulate to blend in collaborative signals. A minimum of 50–100 labeled interactions is a reasonable threshold before interaction embeddings become reliable.

Catalog drift. Agents are updated, deprecated, and replaced. The selection layer needs a re-indexing pipeline that fires whenever an agent’s capabilities or description change. Without this, the ranker will route queries to stale capability profiles.

Latency budget. The selection step should not dominate the overall request latency. Approximate nearest neighbor search (FAISS, ScaNN) over agent embeddings keeps candidate retrieval under a few milliseconds even for catalogs of hundreds of thousands of agents. Reserve the more expensive cross-encoder reranker for the top-k candidates (k = 10–20 is typical).

Evaluation frequency. Unlike static benchmarks, a live selection layer should be evaluated continuously. Track NDCG@1 and NDCG@5 on a rolling window of logged interactions. Drops in these metrics are early signals of catalog drift or distribution shift in incoming queries before they surface as user-visible failures.

The agent selection problem is likely to grow more acute as agent catalogs expand. Treating it as a first-class engineering concern — with a dedicated data pipeline, a structured representation for agents, and rigorous offline and online evaluation — pays dividends in system reliability and observability that ad-hoc routing cannot provide.