Full-lifecycle evaluation across skill · prompt · RAG · agent

Make every change
backed by statistical evidence

Manage, evaluate, improve, and observe your skills, prompts, RAG, and agent context — one rigorous statistical foundation across the whole journey. Bootstrap confidence intervals and length-debiasing are on by default — not an advanced flag, but a safety net you can't ignore.

$ npm i -g oh-my-knowledgeCopy ⧉ Get started →

npm weekly ··· CI passing MIT Node ≥ 22 Same model · same cases · only the knowledge changes

omk — eval · knowledge-carrier evaluation lifecycle live

eval Controlled A/B: same model, same cases, only the knowledge changes — 95% CI on the composite-score lift

✓ v2 clearly beats v1 — ship it CI[+11.2, +25.4]α = 0.81length-debiased

doctor Pre-ship checkup: 7 built-in dimensions scored independently (static rules + LLM audit, offline-capable)

TriggersGood

ClarityGood

PrecisionFair

DepsGood

ToolsGood

SafetyFair

ExamplesGood

🩺 Health: Good · 5 healthy / 2 at risk · 3 suggestions

observe In-prod observation: parse Claude Code sessions to quantify failure rate, cost, and knowledge gaps

Failure rate

4.2%

▼ −1.6pt vs last week

P50 latency

18.4s

▼ −2.1s

Cost / session

$0.012

▲ +$0.003

⚠ Knowledge-gap signal: "missing environment probe upfront" recurs across 12 sessions — ranked #1 by severity weight ×12

Three evaluation capabilities

One pipeline across a skill's whole life

doctor, eval, and observe aren't three tools — they're one measurement discipline at three points in a skill's lifecycle, each answering a different question.

🩺

doctor pre-ship

Is the skill itself written healthily? 7 built-in dimensions scored independently by an LLM audit — repeated sampling with k/n consensus, plus endpoint-driven custom dimensions.

$ omk doctor my-skill --dimensions audit.yaml

Triggers / docs / instructions / deps / tools / safety / examples
--repeat: parallel sampling + k/n consensus
Custom endpoint dimensions: call an API for deep review

📊

eval on release

Is v2 really better than v1? A controlled A/B — same model, same cases, only the knowledge changes. Six dimensions scored independently, one-line verdict with a ship recommendation.

$ omk eval --control v1 --treatment v2

Bootstrap CI / length-debias / saturation curves on by default
Krippendorff α: judge↔human agreement on a gold set
Blind A/B · judge ensemble · multi-run variance

🔭

observe in prod

How is it doing in production? Parse real session JSONL to measure each skill's failure rate, latency, and token cost, and surface severity-weighted knowledge-gap signals.

$ omk observe ~/.claude/sessions

Failure rate / latency / cost broken down per skill
Knowledge-gap detection: quantify risk exposure
Feeds production signal into sample / evolve iteration

Moat · measurement credibility

Rigor is the foundation, not an add-on

Five often-overlooked distortions decide whether a comparison is trustworthy. omk builds every defense into the foundation, so you don't have to enable them one by one.

A point estimate mistakes sampling noise for a real gain

Bootstrap confidence intervals built-in

Reports an interval, not a point — significance is read off directly.

A composite average hides a regression in a single dimension

Three-layer independent scoring · pass-all gate built-in

Fail any of fact / behavior / judge and it doesn't pass.

The control group reads the very carrier under test

construct validity breaks down

strict-baseline isolation built-in

Closes three leaks: skill self-discovery, the Skill tool, and the cwd bypass.

Judges systematically prefer longer answers

Length-debiased judging built-in

Scoring removes the length covariate — verbosity no longer buys points.

The reliability of the judge's own scoring is unmeasurable

Krippendorff α on with a gold set

Anchored on human labels, it quantifies judge↔human agreement.

Peer tools typically cover only one or two of these. omk's choice: build credibility into the foundation rather than leave it optional.

Tool comparison

A side-by-side under one set of criteria

Criteria come from common LLMOps selection axes (metric library / judge / CI / observability / collaboration) plus measurement validity & reliability — not rules tailored to omk. On several axes omk doesn't win, and we mark that honestly.

Capability	omk	promptfoo	DeepEval	LangSmith
Measurement credibility · validity / reliability
Statistical significance (CI / tests)	✓ Bootstrap	—	—	—
Judge ↔ human reliability (agreement)	✓ Krippendorff α	—	—	—
Evaluation bias control (length-debias)	✓ default	—	—	—
Evaluation capability
Assertion / metric library breadth	✓ 30+	✓	✓	◑
RAG-specific metrics	◑ 3	◑	✓ rich	◑
LLM-as-judge	✓	✓	✓	✓
Engineering & collaboration
CI/CD integration (exit-code routing)	✓	✓	✓	◑
Onboarding speed / config simplicity	◑	✓ very fast	◑	◑
Experiment tracking / tracing	—	—	◑	✓ strong
Hosted SaaS dashboard / team collab	—	—	✓	✓
Ecosystem & integration
Community size (GitHub stars, 2026-04)	nascent	9k+	12k+	commercial
Native Claude Code skill	✓	—	—	—

✓ Native◑ Partial / needs config— Not supported

Full comparison (8 tools × 30+ dimensions, incl. RAGAS / OpenAI Evals / lm-eval-harness / inspect-ai) is in the comparison doc, current as of 2026-04 — spot something stale? send a PR. Takeaway: no silver bullet — omk's tradeoff is "statistical credibility by default"; want a SaaS dashboard, pick LangSmith; want academic benchmarks, pick lm-eval-harness.

$ omk install omk-agent-skill # one-time install

✓ Wrote to detected Claude Code / Codex

/omk eval # evaluate this project's artifact

/omk evolve # one shot: checkup → samples → self-iterate

/omk sample # generate or top up eval cases

> Compare v1 and v2 for me

↳ infer intent → omk eval --control v1 --treatment v2 …

Agent integration

Use it right inside your coding agent

One line — omk install omk-agent-skill — installs the official Agent Skill into the Claude Code / Codex it detects locally (--to all writes to all). After that, /omk works out of the box in Claude Code; in Codex and others, just run the omk CLI.

No commands to memorize — state your goal in plain words, and the agent locates the skill from context and picks the right command.

You

Make this skill more robust, then compare it against the last version

Before your next release, let the data speak first.

Run your first eval in 5 minutes

From a bare number to a conclusion that holds up

No files to touch — omk init scaffolds two skill versions and three cases, and omk eval produces an HTML report plus a one-line verdict in under 5 minutes.

$ omk init demo && cd demo && omk evalCopy ⧉

omk