Skip to content

omk quickstart: benchmark a skill

Get your first report in 5 minutes. Target audience: you have one (or several) skill files and want data to answer "is this skill any good?" or "which of v1 vs v2 is the safer bet?"

Setup (1 minute)

bash
npm i oh-my-knowledge -g
omk --version    # prints a version number once installed

If you want the agent-driven workflow (recommended), also install the omk Agent Skill into your coding agent:

bash
omk install omk-agent-skill

By default, this installs only into detected local targets omk explicitly supports: Codex/AGENTS when ~/.codex or ~/.agents exists, and Claude Code when ~/.claude exists. Use --to all to force every target omk currently knows, or --dest for a custom skill root. Once installed, the agent auto-loads the SKILL context when you mention "omk", "benchmark", "evaluate", "skill eval", etc.

Prepare your skill (1 minute)

Place skills under skills/. Single .md files are the simplest layout:

skills/
├── my-skill-v1.md
└── my-skill-v2.md

Switch to a directory layout (one folder per version) only when a skill is large enough to need split-out examples or references:

skills/
├── my-skill-v1/
│   └── SKILL.md
└── my-skill-v2/
    ├── SKILL.md
    └── references/     long examples / reference material
        └── examples.md

Both layouts are recognized; mixing them is fine. For a v1 vs v2 A/B, drop in two versions. To answer "does this skill help at all" with a baseline control, one version is enough.

Run the eval (3 minutes)

Open Claude Code, cd into your project, and say:

Use omk to benchmark skills/my-skill

The omk skill takes care of the rest: it auto-generates samples if eval-samples.json is missing, runs the eval, and opens the report in your browser. Other useful phrasings:

  • "Use omk to compare skills/my-skill-v1 against skills/my-skill-v2"
  • "Use omk to run a baseline control for skills/audit (with-skill vs without-skill)"
  • "Use omk to batch-evaluate every skill under skills/"
  • "Use omk evolve to auto-improve skills/my-skill over 5 rounds"

Path B: command line

bash
omk sample skills/my-skill.md                       # first time: AI-generate eval samples
omk eval --control v1 --treatment v2 --dry-run      # preview the task plan
omk eval --control v1 --treatment v2                # run for real
omk studio                                          # open the report browser

--dry-run prints expected call count and cost estimates — confirm, drop the flag, and run. omk eval runs a doctor health check as a preflight gate by default; if a skill has structural problems it gets blocked early. Pass --skip-doctor to bypass when you know what you're doing.

Read the report (1 minute)

The browser auto-opens (default http://127.0.0.1:7799/). Look at three things:

Verdict (cross-version conclusion): PROGRESS (better) / NOISE (diff inside the confidence band, undecidable) / REGRESS (worse) / CAUTIOUS (trends look good but confidence is thin) — plus two edge cases, UNDERPOWERED (too few samples to conclude) and SOLO (single variant, nothing to compare against). This is the one-line answer to bring to a review meeting.

Composite score: each variant's average on a 0–5 scale, with a against the control. The 95% confidence interval next to Δ is what decides where the verdict lands.

Low-scoring samples: drill into the ones where the LLM tripped. Compare "rubric expectation" against "actual LLM output" — usually the gap points right back at a specific paragraph in the skill that wasn't clear enough.

Things to keep in mind

Sample generation takes time. omk asks the AI to generate 10–20 samples per skill by default, but AI-generated samples are biased: they cluster around the happy-path scenarios the skill already documents well, and undersample edge cases, counterexamples, and misuse paths. Spend 30 minutes after the first run filtering: drop the implausible ones, add the missing boundary cases, add a few "intentionally wrong user instructions" to see whether the skill resists being misled. This is the single biggest variable in how much you can trust your numbers.

Evaluation costs LLM tokens. Rough order of magnitude: one sample × one variant ≈ $0.01–0.05, ten samples × two variants ≈ $0.2–1. Always --dry-run first.

Conclusions only generalize to the sample set you wrote. "This version of the skill is better" is bounded by "better on the N samples I designed". Swap the sample set and the conclusion could flip. So sample design itself is the source of conclusion credibility — don't treat it as a chore before running the eval, it is the eval.

Common task cheat-sheet

GoalNatural languageEquivalent command
First-time scoring of a skillrun a baseline control for skills/Xomk eval --control baseline --treatment X
Before/after a skill changecompare git history skills/X against currentomk eval --control git:X --treatment X
Auto-improve a skilluse omk evolve on skills/X for 5 roundsomk evolve skills/X.md --rounds 5
Batch-eval every skillrun eval on every skill under skills/omk eval --batch
Generate samples onlygenerate eval samples for skills/Xomk sample skills/X.md
Run health-check alonerun a doctor health check on skills/Xomk doctor skills/X.md
View past reportsopen omk studioomk studio

What to bring to the review meeting after your first run

  • Verdict + composite score + Δ + 95% CI band
  • Which samples scored lowest, and where the gap between rubric expectation and LLM output was
  • Which samples you have doubts about by design (it's healthy to "let data speak, then question the data")
  • Doctor health-check result (eval runs it by default; run omk doctor alone if you want to audit the skill's structure separately)

Going deeper