SkillNet: Building a Shared Ontology for AI Agent Skills

How a unified skill ontology and open repository changes the way agents discover, evaluate, and compose capabilities at scale.

Most agent systems treat skills as private artifacts—prompt templates, tool wrappers, and few-shot examples locked inside a single codebase. That works at small scale, but falls apart when you need dozens of agents to share capabilities, reuse proven implementations, or benchmark skill quality in a consistent way. A shared skill ontology and open repository changes this equation by turning skills into first-class, addressable, composable infrastructure.

Why Ad-Hoc Skill Management Breaks Down

When a team builds a coding agent, a research agent, and a customer-support agent independently, each one typically reinvents the same capabilities: web search, document summarization, code execution, entity extraction. There is no canonical definition of what a “web search” skill is, no shared benchmark for whether one implementation outperforms another, and no mechanism for an agent to discover that a better version already exists somewhere in the organization.

This leads to three recurring problems. First, duplication: teams spend engineering cycles rebuilding functionally identical skills. Second, inconsistency: two agents that both claim to do summarization may behave completely differently, making it hard to reason about system-level behavior. Third, opacity: when a skill underperforms in production, there is no structured audit trail connecting that skill to its definition, its evaluation history, or alternative implementations.

The deeper issue is that skills are treated as code artifacts when they should also be treated as knowledge artifacts with identities, relationships, and measurable quality attributes.

The Ontology Layer: Giving Skills a Common Language

A skill ontology provides a structured vocabulary for describing what a skill does, what inputs it consumes, what outputs it produces, and how it relates to other skills. Think of it as a type system for agent capabilities.

At minimum, a useful ontology captures:

Skill identity: a stable, namespaced identifier (e.g., nlp.summarization.extractive) so agents can refer to the same capability unambiguously
Input/output schemas: machine-readable contracts that let orchestrators wire skills together without manual glue code
Hierarchical relationships: a skill like code.python.debug is a specialization of code.debug, which is a specialization of code—traversing this hierarchy lets an agent fall back gracefully when a specific skill is unavailable
Dependency edges: some skills compose others (a research skill might depend on a search skill and a summarization skill), and encoding these dependencies allows automatic capability graphs to be constructed

Note

An ontology is not a registry. A registry is a flat list of things you can call. An ontology encodes meaning—what skills are, how they relate, and what substitutions are valid. You need the ontology layer before a registry becomes useful at scale.

With an ontology in place, an orchestrator can ask questions like: “Does this agent have any skill that is a subtype of data.retrieval?” or “What skills does report.generation depend on, and are all of them available?” These are impossible queries against a flat tool list.

Repository Architecture: Skills as Shared Infrastructure

Once skills have stable identities and schemas, they can be stored in a shared repository—a versioned, searchable catalog that multiple agents and teams can contribute to and consume from.

┌─────────────────────────────────────────────────────────┐
│                     Skill Repository                    │
│                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  Skill Card  │  │  Skill Card  │  │  Skill Card  │  │
│  │  id: nlp.*   │  │  id: code.*  │  │  id: data.*  │  │
│  │  schema      │  │  schema      │  │  schema      │  │
│  │  impl refs   │  │  impl refs   │  │  impl refs   │  │
│  │  eval scores │  │  eval scores │  │  eval scores │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                 │                 │           │
│  ┌──────▼─────────────────▼─────────────────▼───────┐  │
│  │               Ontology Graph                      │  │
│  │  (hierarchy, dependencies, substitution rules)    │  │
│  └───────────────────────────────────────────────────┘  │
└────────────────────────┬────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         ▼               ▼               ▼
   Agent A          Agent B         Orchestrator
  (consumer)       (producer)       (discovery)

Each entry in the repository carries a skill card: a structured document containing the ontology identifier, input/output schemas, one or more implementation references (a function pointer, an API endpoint, a model checkpoint), and a set of evaluation scores. The separation between the definition of a skill and its implementations is critical—it lets the repository hold multiple competing implementations of the same skill and surface the best-performing one for a given context.

Version control matters here for the same reason it matters for software packages. A skill that worked well last month may degrade after a model update. Teams need to pin versions, run regression checks, and roll back when quality drops.

Evaluating Skills in a Structured Way

Ad-hoc skill testing—“I ran it on a few examples and it seemed fine”—does not scale. A shared repository demands a shared evaluation standard so that skill cards can carry meaningful quality signals.

A practical evaluation framework for skills has three layers:

Unit-level evaluation tests a skill in isolation with a fixed benchmark. A summarization skill gets a set of documents with reference summaries and receives an automatic score (ROUGE, BERTScore, LLM-as-judge, or task-specific metrics). This produces a comparable number that can be attached to the skill card and updated whenever the implementation changes.

Integration-level evaluation tests whether the skill behaves correctly when wired into a larger agent workflow. A search skill that scores well in isolation may still cause failures downstream if it returns results in an unexpected format. Integration tests catch schema mismatches and behavioral drift that unit tests miss.

Task-level evaluation measures contribution to end-to-end agent performance. This is the most expensive layer but the most informative—it answers whether having this skill in the agent’s repertoire actually improves outcomes on real tasks.

Tip

For skills with high reuse potential, invest in all three layers. For one-off skills built for a single workflow, unit-level coverage plus manual review is usually sufficient. Don’t apply the same evaluation overhead uniformly.

Scores from all three layers can be aggregated into a skill health score attached to the repository entry. Orchestrators can use this to make routing decisions: prefer the implementation with the highest task-level score when latency allows, fall back to a faster but lower-quality implementation under tight budgets.

Connecting Skills: Discovery and Composition at Runtime

A repository full of well-defined, evaluated skills is only valuable if agents can find and use them. This requires a discovery mechanism that goes beyond keyword search.

Ontology-aware discovery lets an agent query by capability type rather than by name:

# Ontology-aware skill discovery
def find_skills(orchestrator, capability_type: str, min_score: float = 0.8):
    """
    Find all skills that satisfy a capability type,
    filtered by minimum task-level evaluation score.
    """
    candidates = orchestrator.skill_repo.query(
        ontology_type=capability_type,      # e.g. "nlp.summarization"
        include_subtypes=True,              # also match more specific skills
        filters={"eval.task_score": {"gte": min_score}},
        sort_by="eval.task_score",
        order="desc"
    )
    return candidates

# Agent requests a summarization skill without knowing which one exists
skills = find_skills(orchestrator, "nlp.summarization", min_score=0.75)
best = skills[0] if skills else fallback_skill

Composition works through the dependency graph. When an orchestrator needs to run a report.generation skill and discovers that it depends on data.retrieval and nlp.summarization, it can automatically resolve and instantiate those dependencies from the repository—provided all required skills are available and their schemas are compatible.

This composability is where the ontology investment pays off most visibly. An agent can gain new capabilities at runtime simply because a new skill was added to the shared repository, without any code changes to the agent itself. The skill graph becomes living infrastructure that grows as the organization’s AI capabilities grow.

Warning

Runtime skill discovery introduces a dependency on repository availability. If the skill registry goes down, agents that rely on dynamic composition lose capabilities. Design for graceful degradation by caching frequently used skill resolutions locally and maintaining a baseline set of skills that are always statically available.

Engineering Implications

Building toward a shared skill ontology is not an all-or-nothing investment. Teams can start small: define stable identifiers for the five or ten skills that already appear across multiple agents, attach simple evaluation benchmarks, and publish them to an internal registry. The ontology hierarchy can grow incrementally as patterns emerge.

The payoff compounds over time. Agents built later can bootstrap from proven skill implementations rather than starting from scratch. Evaluation infrastructure built for one skill generalizes to others in the same ontology subtree. When a model provider releases a better base model, you have a clear mechanism to re-evaluate every skill in the repository and identify which ones improved and which ones regressed.

The underlying insight is that skills are the unit of reuse in agent systems in the same way that libraries are the unit of reuse in software systems. Treating them as shared, versioned, evaluated infrastructure rather than private implementation details is what makes large-scale agent engineering manageable.