skilpel is a Go CLI for evaluating Codex-style skills. It runs eval prompts
with and without a skill, asks a judge model to score the outputs, and turns the
result into local artifacts, terminal feedback, and CI-friendly gates.
Skills are prompts or instructions. skilpel runs the same eval with the skill
enabled and disabled, then compares the judged results. The point is to make the
claim concrete: this skill improves model output by this much, against these
assertions, under these gates.
Skill repositories need a repeatable way to answer a narrow question: did this
skill change model output in the intended direction? skilpel keeps that check
local, explicit, and suitable for CI.
go test ./...
go run ./cmd/skilpel run --root ./skills --skill my-skill --eval-id basic --baselineModel-backed runs use the provider selected by --provider or provider in a
config file. Run skilpel run --help for the current provider list, default API
key variables, endpoint override rules, and available flags.
OPENAI_API_KEY=... go run ./cmd/skilpel run \
--root ./skills \
--skill my-skill \
--eval-id basic \
--workspace ./.skilpel \
--baseline \
--provider openai \
--target gpt-4o-mini \
--judge gpt-4o-mini \
--min-pass 0.90 \
--min-delta 0.20For scripts and downstream tooling, keep the final summary on stdout as JSON:
go run ./cmd/skilpel run --config skilpel.yaml --output=jsonSee CLI output for stdout, stderr, log-file, and exit-code behavior.
skilpel run supports:
- provider plugins for OpenAI, xAI, Qwen, Anthropic/Claude, and Gemini
- per-skill eval files in YAML or JSON
- skill and eval filtering with
--skilland--eval-id - optional
without_skillbaseline comparison - pass-rate and baseline-delta gates
- JSON artifacts in a workspace directory
- text, JSON, and Markdown final summaries
- structured or pretty progress logs on stderr
cmd/skilpelowns the CLI entrypoint.internal/skilpelowns skill discovery, prompt construction, provider calls, judging, gates, progress logs, reports, and artifacts.- Eval files live with the skill they test, usually as
<skill>/evals/evals.yaml. - Run artifacts are written to the configured workspace, usually
./.skilpel.
For downstream CI, install a tagged version rather than tracking a moving branch:
go install github.com/pasunboneleve/skilpel/cmd/skilpel@$SKILPEL_VERSIONTagged releases also publish prebuilt archives for Linux amd64 and macOS arm64.
go test ./...cmd/skilpel/: CLI binary.internal/skilpel/: evaluator implementation and tests.docs/: focused user documentation.CHANGELOG.md: unreleased and release notes.
go run ./cmd/skilpel --helpgo run ./cmd/skilpel run --helpgo run ./cmd/skilpel version- CLI output
- Eval files
- Changelog
For a complete repository model with skills and evals kept close together, see
pasunboneleve/oiticica-style.
A scalpel cuts away the excess; a stencil preserves the pattern.
The name points at the work skilpel is meant to do: cut away vague confidence
and preserve the repeatable pattern that makes a skill useful.
skilpel is inspired by
agent-skills-eval and
agentskills.io-style skill layouts. It focuses on the subset needed for fast
local iteration and CI: skill discovery, eval-case filtering, provider-backed
model calls, baseline comparison, and explicit pass/fail thresholds.
skilpel is released under the MIT License.