Skill Creator
Officialby anthropics
Skill Creator is the official Anthropic meta-skill for building, testing, and refining Claude agent skills with the same rigor you would apply to production software. Instead of writing a SKILL.md by hand and hoping it works, Skill Creator walks you through a structured lifecycle: capture what the skill should do, draft the instructions, generate realistic test prompts, run evaluations, collect quantitative benchmarks, and iterate until the skill reliably performs across diverse inputs. The evaluation system is where this skill earns its keep. You define test prompts and expected outcomes, and Skill Creator spawns independent subagents that execute each eval in a clean context -- no cross-contamination between runs, no token bleed. Every eval runs in parallel with both a with-skill and a baseline (without-skill or previous-version) configuration, producing side-by-side comparisons. A grader agent scores each output against your assertions, an analyzer agent surfaces hidden patterns in the benchmark data, and a comparator agent can perform blind A/B judging between skill versions so you know which iteration actually performs better, not just which one feels better. Benchmark reports track pass rates, token usage (mean plus standard deviation), and elapsed time across iterations, giving you the hard numbers to decide whether a change helped or hurt. The built-in eval viewer renders all outputs, grades, and feedback in a browser-based interface where you can review results qualitatively and leave per-eval comments that feed directly into the next improvement cycle. Skill Creator also handles description optimization -- the often-overlooked step that determines whether Claude actually triggers your skill when it should. It generates 20 realistic test queries (a mix of should-trigger and should-not-trigger), runs them against your current description, and suggests refinements that reduce both false positives and false negatives. Anthropic reported improved triggering accuracy across 5 of 6 public document-creation skills using this approach. With over 10,000 installs and Anthropic-verified status, Skill Creator is the standard starting point for anyone serious about building skills that survive model updates and perform consistently across Claude.ai, Claude Code, and the API.
Installation
Key Features
- ✓Four operating modes -- Create, Eval, Improve, and Benchmark -- covering the full skill development lifecycle from initial concept to production-ready optimization
- ✓Multi-agent evaluation system that spawns independent subagents to run evals in parallel within clean contexts, eliminating cross-test contamination and accelerating results
- ✓Comparator agents for blind A/B testing between skill versions, with independent judges scoring outputs without knowing which version produced them
- ✓Quantitative benchmark reports tracking pass rates, token usage (mean and standard deviation), and elapsed time across iterations, with delta comparisons between configurations
- ✓Browser-based eval viewer that renders outputs, formal grades, and per-eval feedback in a single interface, with support for iteration-over-iteration comparison
- ✓Description optimization loop that generates realistic trigger and non-trigger queries, tests them against your skill description, and suggests refinements to reduce false positives and false negatives
Use Cases
- →Building a new Claude skill from scratch with structured guidance on SKILL.md anatomy, progressive disclosure patterns, and bundled resource organization
- →Regression testing existing skills against new model versions to detect when behavior shifts before it impacts your team's workflows
- →Running A/B comparisons between two versions of a skill to determine which performs better on real-world tasks with quantitative evidence
- →Optimizing a skill's description field to improve triggering accuracy -- ensuring Claude invokes the skill when it should and stays quiet when it should not
- →Iterating on skill quality through repeated eval-feedback-improve cycles until benchmark pass rates, token efficiency, and qualitative output all meet your standards