danielhuber.dev@proton.me Sunday, April 5, 2026

Agent Selection at Scale: Matching Queries to the Right Agent

How to treat agent selection as a recommendation problem, and what engineers need to know to build systems that route tasks to the right LLM agent automatically.


As agent ecosystems grow from a handful of hand-picked tools to catalogues of hundreds or thousands of deployable agents, the question of which agent should handle this request becomes a first-class engineering problem. Naively routing every query to a general-purpose agent wastes capability; routing blindly to a specialist agent breaks on queries that fall outside its domain. Treating agent selection as a structured recommendation problem—matching a natural-language query to the most capable agent—gives you a principled framework for solving it.

Why Agent Selection Is Hard

On the surface, routing a query to an agent looks like classification: read the query, pick a label. In practice it is much messier. The space of agents is not fixed—new agents are deployed, deprecated, or updated continuously. Each agent may be described only by a natural-language README, a list of tools, or prior interaction logs, none of which form a clean feature vector. Queries are narrative: a user asking “help me draft a contract amendment and check it against our compliance policy” is expressing a compound intent that could map to a legal-drafting agent, a document-comparison agent, or a composition of both.

Traditional intent classification assumes a closed, stable label set and crisp category boundaries. Agent selection violates both assumptions. The right mental model is collaborative filtering or dense retrieval: you are looking for semantic affinity between a query and an agent’s demonstrated competence, not a hard categorical match.

Note

Agent selection is structurally similar to product search or job matching: the query side is a user need expressed in prose, and the item side is a heterogeneous catalogue with mixed-quality metadata. Lessons from information retrieval apply directly.

The Three Agent Types You Need to Handle

Any agent selection system has to cope with at least three agent archetypes, and each presents different signals for matching:

LLM-only agents carry no explicit tool declarations. Their capability is implicit in their system prompt, fine-tuning history, or prior interaction records. Matching against them requires embedding the agent’s behavioral description and comparing it to the query embedding—or, better, using interaction logs as a proxy for demonstrated competence.

Toolkit-only agents expose a deterministic capability surface: a list of callable tools with schemas. Here you can be more precise. A query that mentions “send an email” should score highly against an agent that registers a send_email tool, and you can use structured tool-schema matching in addition to semantic similarity.

Compositional agents are orchestrators that delegate to sub-agents or tools. Matching a query to a compositional agent requires reasoning about its routing logic, not just its own capabilities. This is the hardest case and where naive retrieval breaks down most visibly.

User Query (narrative)


 ┌─────────────┐
 │  Query      │
 │  Encoder    │
 └──────┬──────┘
        │ dense vector

 ┌──────────────────────────────────┐
 │         Agent Index              │
 │  ┌───────────┐ ┌──────────────┐  │
 │  │ LLM-only  │ │Toolkit-only  │  │
 │  │  agents   │ │   agents     │  │
 │  │(embed sys │ │(embed tool   │  │
 │  │  prompt)  │ │  schemas)    │  │
 │  └───────────┘ └──────────────┘  │
 │  ┌───────────────────────────┐   │
 │  │  Compositional agents     │   │
 │  │  (embed routing graph)    │   │
 │  └───────────────────────────┘   │
 └──────────────┬───────────────────┘
                │ top-k candidates

       ┌────────────────┐
       │  Re-ranker /   │
       │  Selector LLM  │
       └────────┬───────┘


       Selected Agent(s)

Building an Agent Selection Pipeline

A practical pipeline has three stages: indexing, retrieval, and re-ranking.

Indexing is where most teams underinvest. For each agent, you want to store: (a) a dense embedding of the agent’s description and tool schemas, (b) structured metadata like supported domains, required permissions, and latency profile, and (c) a sample of past interactions summarized into competence signals. Keep this index fresh—agents change.

Retrieval produces a candidate set using approximate nearest-neighbor search over the dense embeddings. ANN search is fast enough that you can retrieve 20–50 candidates in milliseconds even over large catalogues. Hybrid retrieval—combining BM25 keyword matching with dense vectors—handles edge cases where the query uses exact tool names or domain jargon that embedding models may not weight heavily enough.

Re-ranking applies a more expensive model to the candidate set. This can be a cross-encoder that jointly encodes the query and each agent description, or a small LLM that reasons explicitly about fit. Re-ranking is where you incorporate constraints: if the query requires a particular permission scope, filter hard before scoring.

from typing import List

def select_agent(
    query: str,
    agent_index: AgentIndex,
    reranker: Reranker,
    top_k: int = 5,
    hard_filters: dict = None,
) -> List[Agent]:
    # Stage 1: dense retrieval
    candidates = agent_index.search(
        query=query,
        k=50,
        filters=hard_filters,  # e.g., {"permissions": ["email"]}
    )

    # Stage 2: re-rank with cross-encoder or LLM
    ranked = reranker.rank(
        query=query,
        candidates=candidates,
    )

    return ranked[:top_k]
Tip

Don’t skip the hard-filter step before re-ranking. Letting an LLM re-ranker see agents that fundamentally cannot satisfy a permission or capability constraint wastes tokens and sometimes produces confidently wrong selections.

Evaluating Your Selection System

Selection quality is measured differently from downstream task completion. The core metrics are:

  • Hit@k: Does the correct agent appear in the top-k results? Use k=1, 3, and 5.
  • MRR (Mean Reciprocal Rank): Rewards returning the right agent higher in the list.
  • NDCG: Useful when there are multiple acceptable agents with different quality tiers.

For evaluation data, you need labeled query–agent pairs. If you do not have historical logs, you can bootstrap a dataset by having an LLM generate synthetic queries for each agent based on its description, then manually audit a sample. As your system matures, production interaction outcomes (did the selected agent complete the task?) become the ground truth signal.

Practical Considerations for Production

Agent catalogues in production are rarely static. Design your index to support incremental updates without full re-indexing. Store agent embeddings with a version hash so you can detect when an agent’s description or tool schema has changed and re-embed only what is necessary.

For latency-sensitive applications, precompute embeddings for common query templates or use cached retrieval results for repeated query patterns. The re-ranking step is the bottleneck; expose a fast_select path that skips re-ranking and returns raw ANN results when latency matters more than precision.

Finally, log every selection decision with the query, the candidate set, the selected agent, and the eventual outcome. This log is your most valuable asset for improving the system over time—both for offline evaluation and for fine-tuning a dedicated selection model as your dataset grows.

Tags: researchmulti-agentbenchmarkingevaluationagent-selectionrouting

This article is an AI-generated summary. Read the original paper: AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation .