skill-creator 是 Anthropic Skills 生态系统的元工具——它是用来创建 Skills 的 Skill。
- 🔄 引导用户完成 Skill 创建的完整生命周期:捕获意图 → 调研 → 编写 → 评估 → 优化 → 发布
- 📊 提供量化的评估框架,自动运行测试用例并收集结果
- 🧪 支持基准测试(benchmark),可对比多次迭代的表现差异
- ✍️ 包含 description 优化器,自动改进 Skill 的触发描述
- 📦 内置打包工具,一键将 Skill 打包为可分发的格式
当用户说 “create a skill”、“make a skill”、“build a skill”、“优化 skill 描述”、“评估 skill” 等时触发。
One-Line Summary
Section titled “One-Line Summary”skill-creator is the meta-tool of the Anthropic Skills ecosystem — it’s the Skill for creating Skills.
Core Capabilities
Section titled “Core Capabilities”- 🔄 Guides users through the complete Skill creation lifecycle: capture intent → research → write → evaluate → optimize → publish
- 📊 Provides quantitative evaluation framework, automatically runs test cases and collects results
- 🧪 Supports benchmarking to compare performance variance across iterations
- ✍️ Includes description optimizer that automatically improves Skill trigger descriptions
- 📦 Built-in packaging tool to bundle Skills into distributable format
Trigger Scenarios
Section titled “Trigger Scenarios”Triggers when user says “create a skill”, “make a skill”, “build a skill”, “optimize skill description”, “evaluate skill”, etc.
File Inventory
Section titled “File Inventory”- skill-creator
- SKILL.md
- LICENSE.txt
- scripts
- agents
- references
- eval-viewer
- assets
目录结构分析
Section titled “目录结构分析”skill-creator 是目前仓库中 最复杂的 Skill。其结构体现了”脚本驱动型” Skill 的完整模式:
- SKILL.md:约 200 行,详细描述了 Skill 创建的完整工作流程,包含 7 个阶段
- scripts/:8 个 Python 脚本,形成完整的工具链。核心是评估系统(run_eval.py + run_loop.py),辅以打包、校验、报告等工具
- agents/:3 个子 Agent 定义,分别负责分析结果、对比差异、和打分——这是将复杂评估逻辑拆分为独立 Agent 的经典模式
- eval-viewer/:独立的 HTML 评估查看系统,展示”何时需要用户交互界面”
SKILL.md 结构解析
Section titled “SKILL.md 结构解析”skill-creator 的 SKILL.md 约 200 行,结构层次清晰:
- 概述(第 6-26 行):描述从捕获意图到发布的全流程
- 沟通指南(第 32-41 行):指导 Claude 如何与不同技术水平的用户交流
- 创建流程(第 47-68 行):详细的 4 步创建流程
- Skill 写作指南(第 71+ 行):Skill 的结构、最佳实践、触发技巧
YAML Frontmatter 分析
Section titled “YAML Frontmatter 分析”Directory Structure Analysis
Section titled “Directory Structure Analysis”skill-creator is the most complex Skill in the repository. Its structure embodies the complete “script-driven” Skill pattern:
- SKILL.md: ~200 lines, detailed workflow description covering 7 phases
- scripts/: 8 Python scripts forming a complete toolchain. The evaluation system (run_eval.py + run_loop.py) is the core, supplemented by packaging, validation, and reporting tools
- agents/: 3 sub-agent definitions for analysis, comparison, and grading — the classic pattern of decomposing complex evaluation logic into independent agents
- eval-viewer/: Standalone HTML evaluation viewer, demonstrating “when to build user interfaces”
SKILL.md Structure Analysis
Section titled “SKILL.md Structure Analysis”skill-creator’s SKILL.md is ~200 lines with clear structural hierarchy:
- Overview (lines 6-26): Describes full workflow from intent capture to publishing
- Communication Guide (lines 32-41): How Claude should interact with users of varying technical levels
- Creation Flow (lines 47-68): Detailed 4-step creation process
- Skill Writing Guide (lines 71+): Skill structure, best practices, triggering tips
YAML Frontmatter Analysis
Section titled “YAML Frontmatter Analysis”skill-creator 的核心是 评估系统,围绕它组织脚本和 agent:
Module Relationships
Section titled “Module Relationships”The core of skill-creator is the evaluation system, around which scripts and agents are organized:
skill-creator 脚本与 Agent 关系
graph TD SKILL[SKILL.md] -->|指导| Claude Claude -->|调用| run_eval[run_eval.py] Claude -->|调用| run_loop[run_loop.py] Claude -->|调用| quick_validate[quick_validate.py] Claude -->|调用| package_skill[package_skill.py] Claude -->|调用| improve_description[improve_description.py] run_eval --> utils[utils.py] run_loop --> utils quick_validate --> utils run_eval --> generate_report[generate_report.py] run_loop --> generate_report aggregate_benchmark[aggregate_benchmark.py] --> generate_report run_eval --> grader[agents/grader.md] run_eval --> analyzer[agents/analyzer.md] aggregate_benchmark --> comparator[agents/comparator.md] generate_report --> generate_review[eval-viewer/generate_review.py] generate_review --> viewer[eval-viewer/viewer.html] style SKILL fill:#4fc3f7,stroke:#0288d1,color:#000 style run_eval fill:#81c784,stroke:#388e3c,color:#000 style run_loop fill:#81c784,stroke:#388e3c,color:#000 style generate_report fill:#ffb74d,stroke:#f57c00,color:#000 style grader fill:#ce93d8,stroke:#7b1fa2,color:#000
脚本全量清单
Section titled “脚本全量清单”| 脚本 | 语言 | 行数 | 复杂度 | 功能 |
|------|------|------|--------|------|
| run_eval.py | Python | ~120 | ⭐⭐⭐ | 核心评估器:导入 skill,运行测试用例,收集结果 |
| run_loop.py | Python | ~100 | ⭐⭐⭐ | 迭代循环器:重复运行评估直到收敛 |
| quick_validate.py | Python | ~45 | ⭐⭐ | 快速校验:验证 skill 是否有基本问题 |
| package_skill.py | Python | ~80 | ⭐⭐ | 打包器:将 skill 打包为可分发的 zip 归档 |
| improve_description.py | Python | ~60 | ⭐⭐ | 描述优化器:分析并改进 skill 的 description 字段 |
| generate_report.py | Python | ~90 | ⭐⭐ | 报告生成器:汇总评估结果生成 Markdown 报告 |
| aggregate_benchmark.py | Python | ~70 | ⭐⭐ | 基准聚合器:多次运行结果的统计汇总 |
| utils.py | Python | ~50 | ⭐ | 工具函数:日志、文件操作等共享工具 |
run_eval.py — 核心评估器
Section titled “run_eval.py — 核心评估器”run_eval.py 是整个 skill-creator 的核心。它负责:加载一个 skill、对该 skill 运行一组测试用例、收集 Claude 的输出、并将结果传给 grader agent 进行评分。
run_eval.py is the core of skill-creator. It loads a skill, runs a set of test cases against it, collects Claude outputs, and passes results to the grader agent for scoring.
run_loop.py — 迭代运行器
Section titled “run_loop.py — 迭代运行器”run_loop.py 封装了多次运行评估的逻辑,直到结果收敛或达到最大迭代次数。这是”评估-优化”循环的核心驱动。
run_loop.py encapsulates the logic of running evaluations multiple times until results converge or max iterations are reached. This is the core driver of the “evaluate-optimize” loop.
quick_validate.py — 快速校验
Section titled “quick_validate.py — 快速校验”轻量级校验工具,在完整评估前快速检查 skill 的基本合规性。
Lightweight validation tool that quickly checks a skill’s basic compliance before full evaluation.
package_skill.py — 打包工具
Section titled “package_skill.py — 打包工具”package_skill.py 将完整的 skill 目录打包为可分发的 zip 归档。它自动包含 SKILL.md、scripts/、agents/ 和 references/ 目录,同时排除 pycache 等不需要的文件。打包结果可以直接分享给其他 Claude 用户使用。
package_skill.py bundles a complete skill directory into a distributable zip archive. It automatically includes SKILL.md, scripts/, agents/, and references/ directories while excluding pycache and other unnecessary files. The packaged result can be directly shared with other Claude users.
improve_description.py — 描述优化器
Section titled “improve_description.py — 描述优化器”improve_description.py 分析现有 skill 的 YAML frontmatter description 字段,通过与用户实际对话模式对比来优化触发精度。它利用 grader agent 评估当前 description 的触发效果,并生成改进版本。这是唯一一个使用了 LLM 来优化自身元数据的脚本。
improve_description.py analyzes a skill’s YAML frontmatter description field and optimizes triggering accuracy by comparing it against actual user conversation patterns. It uses the grader agent to evaluate current description triggering effectiveness and generates improved versions. This is the only script that uses an LLM to optimize its own metadata.
generate_report.py — 报告生成器
Section titled “generate_report.py — 报告生成器”generate_report.py 接收 run_eval.py 的输出 JSON,汇总成结构化的 Markdown 报告。报告包含各测试用例的通过/失败状态、平均分、失败模式分类以及改进建议。报告可以直接嵌入到 Claude 的对话上下文中,帮助用户了解 skill 的质量状况。
generate_report.py takes the JSON output from run_eval.py and summarizes it into a structured Markdown report. The report includes pass/fail status per test case, average scores, failure mode categorization, and improvement suggestions. Reports can be embedded directly into Claude’s conversation context to help users understand skill quality.
aggregate_benchmark.py — 基准测试聚合
Section titled “aggregate_benchmark.py — 基准测试聚合”aggregate_benchmark.py 跨多次评估运行汇总统计结果,生成基准测试对比。它使用 comparator agent 对比不同版本的表现差异,并检测回归。这对于 Skill 迭代过程中确保改进不回退非常重要。
aggregate_benchmark.py aggregates statistics across multiple evaluation runs to generate benchmark comparisons. It uses the comparator agent to contrast performance differences across versions and detect regressions. This is critical for ensuring improvements don’t regress during Skill iteration.
utils.py — 工具函数
Section titled “utils.py — 工具函数”utils.py 是所有脚本共享的工具函数集合,包含:日志输出(log_info / log_error)、skill 加载(load_skill)、测试用例发现(find_test_cases)、YAML frontmatter 解析(parse_frontmatter)等。这种”共享 utils”模式是高内聚低耦合的体现——每个脚本专注于自身逻辑,公共操作集中维护。
utils.py is a collection of shared utility functions used by all scripts, including: log output (log_info / log_error), skill loading (load_skill), test case discovery (find_test_cases), YAML frontmatter parsing (parse_frontmatter), and more. This “shared utils” pattern demonstrates high cohesion and low coupling — each script focuses on its own logic while common operations are centrally maintained.
脚本间关系图
Section titled “脚本间关系图”skill-creator 脚本依赖图
graph LR A[run_eval.py] -->|import| U[utils.py] B[run_loop.py] -->|call| A B -->|import| U C[quick_validate.py] -->|import| U D[package_skill.py] -->|import| U E[improve_description.py] -->|import| U F[generate_report.py] -->|import| U G[aggregate_benchmark.py] -->|import| U B -->|call| E A -->|produces JSON| F G -->|produces JSON| F A --> G[agents/grader.md] G --> H[agents/analyzer.md] A --> I[agents/comparator.md] style U fill:#ffb74d,stroke:#f57c00,color:#000 style A fill:#81c784,stroke:#388e3c,color:#000 style B fill:#81c784,stroke:#388e3c,color:#000
- 评估驱动开发:skill-creator 的设计理念是”先有测试,再有 skill”——每个 skill 的优化都基于量化评估结果
- Agent 分包:将复杂的评估逻辑拆分为 3 个独立 Agent(analyzer / comparator / grader),每个 Agent 专注一个维度
- 脚本原子化:每个 Python 脚本只做一件事(打包/校验/评估/报告),通过函数调用组合
- 渐进式复杂度:run_loop.py 默认串行执行,但预留了 —parallel 参数——不提前优化但保留扩展能力
“如果你想做一个 XXX 领域的 skill 创建工具…”
- 保留核心评估框架(run_eval.py + agents/)——这是最值钱的部分
- 替换 test case 格式(从 skill 测试变为你的领域的测试)
- 重写 SKILL.md 中的工作流程(从 “skill creation” 变为你的领域流程)
- 保留 package_skill.py 的打包逻辑(通用)
- 根据需要调整 quick_validate.py 的检查规则
- ⚠️ description 不要写太弱:description 是唯一的触发机制,太短或太泛会导致 skill 不被触发
- ⚠️ Agent 指令要具体:grader.md 中如果没有明确的评分标准,评估结果会不稳定
- ⚠️ subprocess timeout:调用外部命令时一定要设 timeout,否则可能无限挂起
Design Highlights
Section titled “Design Highlights”- Evaluation-Driven Development: The design philosophy is “tests before skill” — every skill optimization is based on quantitative evaluation results
- Agent Decomposition: Complex evaluation logic split into 3 independent Agents (analyzer / comparator / grader), each focused on one dimension
- Atomic Scripts: Each Python script does one thing (packaging/validation/evaluation/reporting), composed via function calls
- Progressive Complexity: run_loop.py defaults to serial but reserves —parallel parameter — no premature optimization but keeps extension path open
Reusable Patterns
Section titled “Reusable Patterns”Porting Guide
Section titled “Porting Guide”“If you want to create a skill creation tool for domain XXX…”
- Keep the core evaluation framework (run_eval.py + agents/) — this is the most valuable part
- Replace test case format (from skill tests to your domain’s tests)
- Rewrite the workflow in SKILL.md (from “skill creation” to your domain workflow)
- Keep package_skill.py packaging logic (generic)
- Adjust quick_validate.py check rules as needed
Common Pitfalls
Section titled “Common Pitfalls”- ⚠️ Don’t write weak descriptions: description is the only trigger mechanism — too short or vague means the skill won’t trigger
- ⚠️ Make Agent instructions specific: Without clear grading criteria in grader.md, evaluation results become unstable
- ⚠️ Always set subprocess timeout: When calling external commands, always set timeout to prevent infinite hangs
| 模式 | 说明 | 适用于... |
|---|---|---|
| Agent 分包 | 复杂评估拆为多个独立 Agent | 任何需要多维度判断的 skill |
| 评估驱动迭代 | 量化评估 → 自动改进 → 再评估 | 需要持续优化的 skill |
| 脚本工具链 | 每个脚本一个职责,通过 import 组合 | 任何脚本驱动型 skill |
| HTML 查看器 | Python 生成 HTML,浏览器交互查看 | 需要可视化结果的 skill |
| Pattern | Description | Applies to... |
|---|---|---|
| Agent Decomposition | Split complex evaluation into independent Agents | Any skill needing multi-dimensional judgment |
| Eval-Driven Iteration | Quantify → auto-improve → re-evaluate | Skills requiring continuous optimization |
| Script Toolchain | Each script has one responsibility, composed via imports | Any script-driven skill |
| HTML Viewer | Python generates HTML, browser for interactive viewing | Skills needing visual results |