EvoSkill: Automated Skill Discovery for Multi-Agent Systems

How iterative failure analysis and Pareto-frontier selection can automatically grow and prune an agent's skill library without human curation.

Most multi-agent systems ship with a fixed skill library — a curated set of tools or sub-routines that engineers write by hand and update manually when gaps appear. This works until the task distribution drifts, agents encounter edge cases the original authors didn’t anticipate, or the system is asked to generalize to a new domain. EvoSkill addresses this by treating skill discovery as an automated, ongoing process: agents expose their own failure modes, and those failures seed new candidate skills that are tested and selectively retained.

The Problem with Static Skill Libraries

In the Skills Pattern, agents are given a palette of reusable capabilities — functions, RAG retrievers, code templates, API wrappers — that the orchestrator can compose to solve tasks. The quality of the system’s output is directly bounded by the quality of that palette. When a task arrives that falls outside the existing skill set, the agent either improvises badly or fails entirely.

The conventional fix is human review of failure logs followed by manual skill authoring. This creates a slow, expensive feedback loop that doesn’t scale. Engineers must read traces, recognize patterns across failures, write new skill implementations, test them, and redeploy. In a high-volume production system, the gap between when a new failure pattern emerges and when a skill exists to handle it can be days or weeks.

Automated skill discovery flips this loop: instead of waiting for a human to notice a pattern, the system continuously monitors its own failures and proposes candidate skills to fill them.

Failure Analysis as a Skill Signal

The core insight in EvoSkill is that a failed agent trace is structured information, not just noise. When an agent fails to complete a task, the trace contains the goal, the sequence of tool calls attempted, intermediate results, and the point of breakdown. That information is sufficient to ask: what reusable capability, if it had existed, would have prevented this failure?

This framing turns failure analysis into a generation problem. An LLM examines a batch of failed traces, identifies common patterns — repeated tool call sequences that didn’t converge, missing abstraction layers, recurring data transformation steps — and generates candidate skill implementations. Each candidate is essentially a new tool or sub-agent behavior that abstracts over what the agent was trying to do when it failed.

Note

Candidate skills generated from failure traces are hypotheses, not solutions. The framework’s value comes from how it validates and prunes those hypotheses, not from the generation step alone.

This is meaningfully different from prompt engineering or chain-of-thought tuning. Those approaches improve how an agent uses existing skills. Failure-driven skill discovery expands the skill inventory itself.

Pareto-Frontier Skill Retention

Generating candidate skills is cheap; accepting all of them would pollute the skill library with redundant, overlapping, or brittle implementations. EvoSkill applies a Pareto-frontier approach to filter candidates: a new skill is retained only if it improves performance on the validation set without degrading performance on tasks that were previously solved.

This is a multi-objective selection problem. Consider two metrics: coverage (does the skill help on previously-failing tasks?) and interference (does adding the skill break anything that worked before?). A skill that improves coverage at zero interference cost is a clear keeper. A skill that improves coverage but introduces regressions sits off the Pareto frontier and is discarded.

Failed Traces
     │
     ▼
┌─────────────────────┐
│  Failure Analyzer   │  (LLM reviews trace batches)
│  (pattern mining)   │
└────────┬────────────┘
         │ Candidate Skills
         ▼
┌─────────────────────┐
│  Validation Runner  │  (runs candidates on held-out tasks)
│                     │
│  coverage ──────┐   │
│  interference ──┤   │
└────────────────┬┘   │
                 │     │
                 ▼     │
          Pareto Filter│
                 │     │
         ┌───── ▼ ─────┘
         │  Skill Library
         │  (retained skills)
         └──────► Agent Pool

The Pareto approach also handles the library growth problem. Without a pruning mechanism, a self-evolving skill library accumulates redundancy over time — multiple skills that do similar things in slightly different ways. By requiring Pareto dominance for retention, the framework applies competitive pressure: a new skill that duplicates an existing one without outperforming it on any metric is automatically excluded.

Engineering a Self-Evolving Skill Loop

Implementing this pattern in a production system requires three infrastructure pieces beyond the agents themselves.

Failure collection and batching. Agents must emit structured failure signals — not just a boolean success/fail, but the full trace with goal, actions, intermediate state, and error classification. A simple schema:

@dataclass
class AgentFailureTrace:
    task_id: str
    goal: str
    steps: list[AgentStep]  # tool calls + results
    failure_point: int      # index into steps
    error_type: str         # classification: tool_gap, reasoning, data_format, etc.
    context: dict           # any additional metadata

Traces should be batched by error type before being sent to the analyzer. Batching by type dramatically improves the quality of generated candidates because the LLM sees multiple examples of the same failure pattern rather than a noisy mix.

Candidate generation and sandboxing. Generated skill candidates must be executed in an isolated environment before validation. Skills are code — they can have side effects, make external calls, or contain bugs. Use a sandbox with resource limits and a timeout:

async def evaluate_candidate_skill(
    skill_code: str,
    validation_tasks: list[Task],
    timeout_seconds: int = 30,
) -> SkillEvalResult:
    sandbox = IsolatedExecutor(memory_limit_mb=512, timeout=timeout_seconds)
    results = await sandbox.run_batch(skill_code, validation_tasks)
    return SkillEvalResult(
        coverage=compute_coverage(results),
        interference=compute_interference(results, baseline_results),
    )

Pareto filter and library versioning. The skill library should be versioned so that regressions introduced by a faulty candidate can be rolled back. When a candidate passes the Pareto filter, it’s added to a new library version; agents are migrated to the new version only after a canary period.

Tip

Version your skill library the same way you version a dependency: semantic versioning with a changelog. When a self-evolving system adds skills automatically, auditability becomes critical. Engineers need to know why a skill was added and which failure batch triggered it.

When Automated Skill Discovery Makes Sense

This pattern is most valuable in systems where task diversity is high and unpredictable — customer-facing agents that encounter open-ended requests, research agents that operate across domains, or coding agents deployed across heterogeneous codebases. The payoff grows with volume: the more failures a system processes per day, the more signal is available for candidate generation and the faster the skill library can close gaps.

It is less useful in tightly scoped systems with a stable, well-understood task distribution. If your agent handles a narrow, predictable set of requests, manual skill curation is probably sufficient and gives you more control over library quality. Automated discovery also adds operational complexity — you’re now running an inner loop that modifies agent behavior, which requires careful monitoring and rollback capabilities.

The deeper value of the EvoSkill approach is conceptual: it reframes agent skill libraries as living artifacts that should evolve with the agent’s failure distribution, rather than static configurations that engineers update reactively. Building the feedback loop — failure collection, candidate generation, Pareto validation, library versioning — is an investment, but one that pays dividends as the task space expands.