Skip to content

Glossary

omk docs (blog posts, SKILL.md, CLI output, report pages) freely mix industry-standard ML / statistics / measurement terms. These words are de facto standard in the English community, so this table is a quick-reference index: each entry gives a one-line definition + where it shows up in omk.

Scope: a reader's cheat sheet, not a design spec. omk maintainers follow this vocabulary when writing new docs.

Sibling docs: terminology spec (maintainer-internal decision record) / statistical rigor / composite-score construct validity


1. Statistics / measurement

TermOne-line definitionWhere it shows up in omk
bootstrap CIDistribution-free 95% confidence interval, computed by resampling (1000 iterations by default)omk eval --bootstrap; the "paired comparison" table on the report page
Δ (delta)Mean difference in composite score between treatment and controlhero "Δ +2.778"; paired comparison table
95% CIThe true mean falls in this interval with 95% probability. CI excluding 0 = a significant differencehero tooltip; paired comparison table
significantCI excludes 0 (the gap is not by chance)reliability check ✓ significant-difference badge
Pearson rPearson correlation coefficient. 1 = perfectly aligned / 0 = unrelated / -1 = perfectly opposed"cross-sample judge agreement" table for the multi-judge ensemble
MADMean absolute deviation. Average distance among judges scoring the same sample. On a 1-5 scale, < 0.5 is tight agreement, > 1.5 is large disagreementmulti-judge agreement table
Krippendorff αOrdinal-weighted multi-judge agreement. α ≥ 0.8 high agreement / 0.667-0.8 acceptable / < 0.4 lowHuman gold section
p-valueProbability that a gap this large appears by chance; smaller is more significant (0.05 is the usual threshold)t-test section (not omk's primary path; bootstrap takes priority)
effect sizeThe gap relative to the noise (Cohen's d / Hedges' g), putting a scale on "how big the difference is"Cohen's d column in the variance / significance table
CVCoefficient of variation, stddev / mean, a stability metric. On a 1-5 scale, < 5% stable / 5-15% medium / > 15% unstablestability column + hero tooltip
stddev (σ)Standard deviation, a statistic for the magnitude of value fluctuationstability column
saturationThe point where adding more samples no longer changes the conclusion (CI-width shrinkage flattens out)reliability check "✓ saturated" badge
holdout (set)Independent validation samples the skill never explicitly covered, used to guard against sample-set overfittingpost-evaluation follow-up recommendations
construct validityWhether the measurement actually measures the intended thing (vs measurement error)scoring.md: composite-score construct-validity argument
ad hocAn implementation choice made without a principled justification — typically "ship it first, justify later"scoring.md: equal-weight composite aggregation is ad hoc
sample-set overfittingThe evaluation set happens to be "already answered," inflating scoresscoring.md / evaluation blog caveat section
length debiasCorrection for the known LLM-judge bias of scoring longer answers higheron by default; disable with omk eval --no-debias-length

2. omk evaluation concepts

TermOne-line definitionWhere it shows up in omk
artifactThe unified abstraction for omk's "thing under evaluation": skill / prompt / agent / workflow / baselinedetermined by experiment role (--control / --treatment / baseline), not a standalone flag
executorHow the model is run: claude / codex / openai-api / gemini--executor parameter; execution-environment fingerprint
ensemble (judge)Multiple LLMs act as judges and score independently, then combine--judge-models claude:opus,claude:sonnet
judgeAn LLM scoring against a rubricjudge model parameter; evidence table
rubricThe detailed criteria a judge follows when scoring (must recognize X / must include Y / at least N items / ...)rubric field in sample config
anchorA method for calibrating the LLM judge against human standards--gold-dir human anchors
gate (layer gate)Three independent layer significance tests (fact / behavior / judge); a regression in any layer triggers CAUTIOUS+verdict algorithm; "variance / significance" table on the report page
verdictOne of six tiers: PROGRESS / REGRESS / CAUTIOUS / NOISE / UNDERPOWERED / SOLOhero badge; CLI verdict output
sample (evaluation sample)A single evaluation caseeval-samples.json
eval-samplesThe sample config file (each entry has prompt / rubric / assertion / capability)omk eval --samples
baseline (reserved variant)The control group with no skill injected; omk reserves this variant name--control baseline
treatmentThe experiment group with the skill injected--treatment <name>
controlAn alias for baseline--control <name>
composite (score)Equal-weight mean of the fact / behavior / judge layers on a 1-5 scalefirst column of the six-dimension comparison table
fact (layer)Assertion pass rate mapped to 1-5 via 1 + ratio*4"📋 Fact" in the six-dimension comparison table
behavior (layer)Pass rate of process-level assertions (tool calls / turns / cost caps)"🛠️ Behavior" in the six-dimension comparison table
judge (layer)The 1-5 score the judge gives directly against the rubric"💬 LLM judge" in the six-dimension comparison table
dimensionCapability-aligned scoring dimension (not part of composite)five-layer scoring pipeline architecture
reliability checkFour evidence blocks — judge agreement / significant difference / saturation / human alignment — collapsible on the report pagedetails block on the report page

3. Machine learning / AI general

TermOne-line definition
promptThe input text given to an LLM
system promptBackground instructions injected before user input; for evaluation omk injects the entire SKILL.md as the system prompt
agentAn AI that can call tools and run over multiple turns
workflowA multi-step AI process orchestration
skillOne of omk's core evaluation targets, usually in the form of a SKILL.md
tool callAn external function the LLM invokes during execution
turnOne interaction unit of "LLM output + user/tool response"
contextAll the history the LLM sees while generating
fingerprintA version hash from a SHA-256 prefix (12 chars by default), used to verify consistency across runs
session traceThe event stream of one complete AI conversation (prompt / tool calls / output / scoring), the object observe parses

4. omk's three stages

omk's loop runs in three stages — doctor (preflight health) → eval (offline A/B + verdict) → observe (production traces) — together covering knowledge evaluation + management + insight. See the three stages for the full mental model.


Writing conventions

The terms above are the shared vocabulary for omk docs. Detailed naming decisions for omk-internal terms (artifact / executor / variant / verdict, etc.) live in the terminology spec (a maintainer archive). The Chinese docs additionally follow GB/T 15834 punctuation and a set of translation rules, documented in the Chinese glossary; they don't apply to English prose.