Skills Pattern | Agent Engineering

A filesystem-based approach to tool management that achieves 98% token savings by loading tool definitions on-demand rather than sending all tools on every request.

Traditional function calling requires sending every tool definition with every API request. For a capable agent with 50 specialized tools, each definition averaging 3,000 tokens, that is 150,000 tokens of overhead per request — before any user message, context, or reasoning has been included. This scaling problem makes complex agents prohibitively expensive, and it gets worse as you add capabilities. The Skills Pattern solves this by treating tools as files on disk that are loaded on demand, rather than static definitions passed wholesale with every call.

The Problem: Context Bloat from Tools

The core insight is simple: an agent almost never needs all of its tools simultaneously. A user asking to “search the web for recent news about AI” needs the web-search skill. They do not need the code-review skill, the data-analysis skill, or the email-composer skill. Sending all of those definitions with every request wastes tokens, inflates latency, and degrades reasoning quality by crowding the context with irrelevant information.

The Skills Pattern solves this by giving the agent access to a skills/ directory. The agent reads skill files as needed — exactly like a developer reads documentation — rather than having everything preloaded into memory whether it is relevant or not.

50 tools × 3,000 tokens/tool = 150,000 tokens/request (traditional)
50 skills × 50 tokens/metadata + 1 skill × 1,000 tokens = 3,500 tokens (skills pattern)
Savings: 97.7%

Three Pillars of the Skills Pattern

Skills Pattern Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Skills Pattern                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. FILESYSTEM AS TOOL STORAGE                              │
│     skills/                                                 │
│     ├── web-search/SKILL.md                                 │
│     ├── code-review/SKILL.md                                │
│     └── data-analysis/SKILL.md                              │
│                                                             │
│  2. PROGRESSIVE DISCLOSURE                                  │
│     ┌──────────┐   ┌───────────────┐   ┌──────────────┐    │
│     │ Metadata │ → │ Instructions  │ → │  Examples    │    │
│     │ ~50 tok  │   │ ~1000 tok     │   │ ~2000 tok    │    │
│     └──────────┘   └───────────────┘   └──────────────┘    │
│         ↑                  ↑                  ↑             │
│      Always            On select          If complex        │
│                                                             │
│  3. DATABASE-BACKED DISCOVERY (Optional)                    │
│     ┌─────────────┐                                         │
│     │ Vector DB   │  ← Embed skill descriptions             │
│     │ (Chroma,    │  ← Semantic search for relevance        │
│     │  Qdrant)    │  ← Skip metadata scanning               │
│     └─────────────┘                                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1. Filesystem as Tool Storage

Skills are organized as directories, each containing a SKILL.md file with YAML frontmatter for metadata and a markdown body for instructions. The agent can list, read, and navigate these files using standard filesystem tools — no special API required. This makes skills introspectable, version-controllable, and user-extensible without code changes.

The following shows how to scan the filesystem and load skill metadata:

from pathlib import Path
from dataclasses import dataclass
import yaml

@dataclass
class SkillMetadata:
    name: str
    description: str
    triggers: list[str]
    tools_required: list[str]

def discover_skills(skills_dir: Path) -> dict[str, SkillMetadata]:
    """Scan filesystem to discover available skills."""
    skills = {}

    for skill_path in skills_dir.iterdir():
        if not skill_path.is_dir():
            continue

        skill_file = skill_path / "SKILL.md"
        if not skill_file.exists():
            continue

        # Parse SKILL.md frontmatter
        content = skill_file.read_text()
        metadata = parse_skill_frontmatter(content)

        skills[skill_path.name] = SkillMetadata(
            name=metadata.get("name", skill_path.name),
            description=metadata.get("description", ""),
            triggers=metadata.get("triggers", []),
            tools_required=metadata.get("tools", [])
        )

    return skills

def parse_skill_frontmatter(content: str) -> dict:
    """Extract YAML frontmatter from SKILL.md."""
    if not content.startswith("---"):
        return {}
    end_idx = content.find("---", 3)
    if end_idx == -1:
        return {}
    frontmatter = content[3:end_idx].strip()
    return yaml.safe_load(frontmatter)

Each SKILL.md file follows a standard format. The frontmatter captures everything needed for skill selection — name, description, trigger phrases, required tools, and a token estimate for prioritization. The markdown body provides the full instructions that get loaded only when the skill is selected.

The triggers field deserves particular attention. These are phrases that help the agent quickly match user requests to relevant skills without reading the full instructions. Specific, domain-relevant trigger phrases dramatically improve selection accuracy. Overly generic triggers like “help me” match everything and should be avoided.

2. Progressive Disclosure

Not all skill information is needed for every request. Progressive disclosure loads context in stages, adding only what is necessary at each step.

Three stages of progressive skill loading
Stage	Content	Tokens	When Loaded
1. Metadata	Name, description, triggers	~50/skill	Always (for selection)
2. Instructions	Full SKILL.md body	~500–1000	After skill selected
3. Resources	Examples, templates, schemas	Variable	Only for complex tasks

from dataclasses import dataclass
from enum import Enum
from pathlib import Path

class DisclosureLevel(Enum):
    METADATA = 1     # Name, description, triggers (~50 tokens)
    INSTRUCTIONS = 2 # Full SKILL.md body (~500-1000 tokens)
    RESOURCES = 3    # Examples, templates (~variable)

@dataclass
class SkillContext:
    name: str
    level: DisclosureLevel
    content: str
    token_count: int

class ProgressiveSkillLoader:
    def __init__(self, skills_dir: Path):
        self.skills_dir = skills_dir
        self._metadata_cache: dict[str, dict] = {}

    def get_skill_list(self) -> list[dict]:
        """Stage 1: Return minimal metadata for all skills."""
        skills = []
        for skill_path in self.skills_dir.iterdir():
            if not skill_path.is_dir():
                continue
            metadata = self._load_metadata(skill_path.name)
            skills.append({
                "name": metadata["name"],
                "description": metadata["description"][:100],  # Truncate
                "triggers": metadata.get("triggers", [])[:5]   # Limit
            })
        return skills

    def load_instructions(self, skill_name: str) -> SkillContext:
        """Stage 2: Load full instructions on demand."""
        skill_file = self.skills_dir / skill_name / "SKILL.md"
        content = skill_file.read_text()
        body = self._extract_body(content)
        return SkillContext(
            name=skill_name,
            level=DisclosureLevel.INSTRUCTIONS,
            content=body,
            token_count=self._estimate_tokens(body)
        )

    def load_resources(self, skill_name: str) -> SkillContext:
        """Stage 3: Load examples and additional resources."""
        examples_dir = self.skills_dir / skill_name / "examples"
        resources = []
        if examples_dir.exists():
            for example_file in examples_dir.iterdir():
                resources.append(example_file.read_text())
        combined = "\n---\n".join(resources)
        return SkillContext(
            name=skill_name,
            level=DisclosureLevel.RESOURCES,
            content=combined,
            token_count=self._estimate_tokens(combined)
        )

    def _load_metadata(self, skill_name: str) -> dict:
        if skill_name not in self._metadata_cache:
            skill_file = self.skills_dir / skill_name / "SKILL.md"
            content = skill_file.read_text()
            self._metadata_cache[skill_name] = parse_skill_frontmatter(content)
        return self._metadata_cache[skill_name]

# Usage in agent
class SkillAwareAgent:
    def __init__(self, loader: ProgressiveSkillLoader):
        self.loader = loader

    def process(self, query: str) -> str:
        # Stage 1: Select skill from metadata
        skill_list = self.loader.get_skill_list()
        selected = self.llm.select_skill(query, skill_list)

        if not selected:
            return self.llm.respond_without_skill(query)

        # Stage 2: Load instructions
        context = self.loader.load_instructions(selected)

        # Stage 3: Load examples for complex tasks
        if self.is_complex_task(query):
            resources = self.loader.load_resources(selected)
            context.content += "\n\n" + resources.content

        return self.llm.respond_with_context(query, context.content)

The token savings compound quickly. With 50 skills and the progressive approach, a typical request uses roughly 3,500 tokens instead of 150,000 — a 97.7% reduction. The vector search variant skips the metadata scan entirely, reducing this further to about 3,000 tokens.

Trade-off

Progressive disclosure adds latency through extra LLM calls for skill selection. For time-critical applications, consider pre-loading frequently-used skills or using vector search for instant matching.

3. Database-Backed Tool Discovery

For skill libraries with 50 or more skills, scanning metadata files on every request becomes slow. Vector databases enable instant semantic search: embed each skill’s description and triggers once at index time, then at runtime query with the user’s message to find the closest match without touching the filesystem.

Vector-Based Skill Discovery

User Query: "help me analyze this spreadsheet"
                  │
                  ▼
          ┌───────────────┐
          │ Embed Query   │
          │ (384-dim vec) │
          └───────────────┘
                  │
                  ▼
    ┌─────────────────────────┐
    │     Vector Database     │
    │  ┌─────────────────┐   │
    │  │ data-analysis   │●──┼── 0.92 similarity
    │  │ visualization   │●──┼── 0.78 similarity
    │  │ web-search      │●──┼── 0.31 similarity
    │  │ code-review     │●──┼── 0.22 similarity
    │  └─────────────────┘   │
    └─────────────────────────┘
                  │
                  ▼
      Top match: data-analysis
      Load: skills/data-analysis/SKILL.md

import chromadb
from chromadb.utils import embedding_functions
from dataclasses import dataclass

@dataclass
class SkillMatch:
    name: str
    description: str
    score: float
    tools: list[str]

class VectorSkillDiscovery:
    def __init__(self, persist_dir: str = "./skill_vectors"):
        self.client = chromadb.PersistentClient(path=persist_dir)

        self.embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name="all-MiniLM-L6-v2"
        )

        self.collection = self.client.get_or_create_collection(
            name="skills",
            embedding_function=self.embedding_fn,
            metadata={"hnsw:space": "cosine"}
        )

    def index_skill(self, skill: dict) -> None:
        """Add or update a skill in the vector database."""
        text = f"{skill['name']}: {skill['description']}"
        if skill.get('triggers'):
            text += f" Triggers: {', '.join(skill['triggers'])}"

        self.collection.upsert(
            ids=[skill['name']],
            documents=[text],
            metadatas=[{
                "name": skill['name'],
                "description": skill['description'],
                "tools": ",".join(skill.get('tools', [])),
                "token_estimate": skill.get('token_estimate', 0)
            }]
        )

    def find_skills(
        self,
        query: str,
        top_k: int = 3,
        min_score: float = 0.5
    ) -> list[SkillMatch]:
        """Find most relevant skills for a query."""
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )

        matches = []
        for i, distance in enumerate(results['distances'][0]):
            score = 1 - distance
            if score < min_score:
                continue
            metadata = results['metadatas'][0][i]
            matches.append(SkillMatch(
                name=metadata['name'],
                description=metadata['description'],
                score=score,
                tools=metadata['tools'].split(',') if metadata['tools'] else []
            ))

        return matches

    def reindex_all(self, skills_dir) -> int:
        """Reindex all skills from filesystem."""
        count = 0
        for skill_path in skills_dir.iterdir():
            if not skill_path.is_dir():
                continue
            skill_file = skill_path / "SKILL.md"
            if not skill_file.exists():
                continue
            metadata = parse_skill_frontmatter(skill_file.read_text())
            metadata['name'] = skill_path.name
            self.index_skill(metadata)
            count += 1
        return count

Discovery approaches compared
Approach	Pros	Cons	Best For
Keyword/Trigger	Simple, fast, no dependencies	Misses synonyms, brittle	<20 skills
LLM Selection	Understands intent	Extra API call, latency	20–50 skills
Vector Search	Semantic matching, fast	Requires embedding model	50+ skills
Hybrid	Best accuracy	Most complex	Production systems

Evaluation

Skill discovery quality should be measured with explicit test cases that have ground-truth skill assignments. The key metrics are precision at rank 1 (is the top result correct?) and mean reciprocal rank (how high does the correct skill appear on average?). Latency matters too — vector search should complete in under 50ms.

from dataclasses import dataclass
import time

@dataclass
class TestCase:
    query: str
    relevant_skills: set[str]

@dataclass
class EvaluationResult:
    precision_at_1: float
    precision_at_3: float
    recall_at_3: float
    avg_latency_ms: float
    mrr: float  # Mean Reciprocal Rank

def evaluate_skill_discovery(
    test_cases: list[TestCase],
    discovery
) -> EvaluationResult:
    """Evaluate skill discovery accuracy and performance."""

    p1_scores, p3_scores, recall_scores = [], [], []
    latencies, reciprocal_ranks = [], []

    for case in test_cases:
        start = time.perf_counter()
        results = discovery.find_skills(case.query, top_k=3)
        latencies.append((time.perf_counter() - start) * 1000)

        result_names = [r.name for r in results]

        # Precision@1
        p1_scores.append(1.0 if result_names[0] in case.relevant_skills else 0.0)

        # Precision@3
        hits = sum(1 for r in result_names[:3] if r in case.relevant_skills)
        p3_scores.append(hits / 3)

        # Recall@3
        recall_scores.append(hits / len(case.relevant_skills))

        # Mean Reciprocal Rank
        for i, name in enumerate(result_names):
            if name in case.relevant_skills:
                reciprocal_ranks.append(1.0 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0.0)

    return EvaluationResult(
        precision_at_1=sum(p1_scores) / len(p1_scores),
        precision_at_3=sum(p3_scores) / len(p3_scores),
        recall_at_3=sum(recall_scores) / len(recall_scores),
        avg_latency_ms=sum(latencies) / len(latencies),
        mrr=sum(reciprocal_ranks) / len(reciprocal_ranks)
    )

Key metrics for skill discovery evaluation
Metric	What it Measures	Target
Precision@1	Is the top result the right skill?	>90%
Precision@3	How many of top 3 are relevant?	>80%
Mean Reciprocal Rank	How high is the correct skill ranked?	>0.85
Latency	Time to find relevant skill(s)	<50ms
False Positive Rate	Skills selected but not relevant	<5%

Common Pitfalls

Overly Generic Triggers

Triggers like “help me” or “do this” match everything. Use specific action verbs and domain terms. Each trigger should be narrow enough to discriminate between skills.

Missing Negative Examples

Skills should document when NOT to use them. Without negative examples, a skill may be selected for similar-sounding queries where it is actually inappropriate.

Stale Embeddings

When using vector search, remember to re-embed skills after updates. Implement hash-based change detection to know when a skill has changed and needs re-indexing.

Too Many Small Skills

Prefer fewer, well-documented skills over many tiny ones. Each skill selection adds cognitive load for the agent and one more opportunity for misclassification.