omk terminology spec
Scope: This is a naming-decisions archive for omk maintainers (why
artifactinstead ofevaluand, why--variantswas dropped from v0.16, thequalityScore→judgeScoremigration path, etc.). It is not a getting-started doc — for everyday usage see the README. The source code is the canonical reference, since the key terms are all English anyway.
1. Goals
This spec unifies the user-facing copy, command examples, data structures, and code naming used across omk's ongoing iterations.
Three goals:
- Align with industry and open-source conventions, minimizing omk-private jargon.
- Separate the four layers — "thing being evaluated", "runtime environment", "experiment grouping", and "experiment role" — so they don't get conflated.
- Keep a single abstraction that extends to future carriers: skill, agent, workflow, agent team, and beyond.
2. Standard terms
1. Artifact
artifact is omk's standard term for "the thing being evaluated".
It is the object that gets compared, injected, run, or observed in an experiment. It can be:
baselineskillpromptagentworkflow- a future
teamor other new kind of knowledge carrier
Rules:
- Prefer
artifactin user-facing docs. - Prefer
artifactin core internal types, request structures, and task structures.
2. Artifact kind
artifact kind is the concrete category of an artifact.
Currently supported:
baselineskillpromptagentworkflow
Rules:
baselineis the empty artifact — no explicit artifact is injected. For most users it just means "nothing at all".skill,agent, andworkfloware subtypes of artifact, not the top-level umbrella term.- When adding a new carrier, extend
artifact kindrather than spinning up a parallel abstraction.
3. Variant
A variant is the expression of one comparison arm in an experiment, not the domain object itself.
For example:
baselineprd/path/to/SKILL.md(the runtime context cwd is declared separately, not encoded into the expression)
Rules:
- Resolving a variant expression yields an artifact plus a runtime context.
- Every variant must be bound to an experiment role (control or treatment); see section 4.
- The CLI declares variants by experiment role (
--control/--treatment); the flat--variantsparameter is no longer used.
4. Experiment role
experiment role is the role a variant plays in a given experiment, using standard statistical terminology.
Enum:
control— the control group, providing the baseline measurement.treatment— the treatment (experimental) group, compared against control to see what changes.
Rules:
- Role is a run-time property of a variant, not an intrinsic property of the artifact; the same artifact can play different roles across runs.
- The CLI declares it via two separate parameters,
--control <expr>and--treatment <v1,v2,...>. - Reports display control/treatment labels; the role is no longer inferred back from
artifactKind === 'baseline'. baselineis an artifact-kind term, not an experiment-role term; see the boundaries in section 3.
5. Runtime context
runtime context is the run-time environment; the most central piece today is cwd.
It is the environment the model or agent runs in, as opposed to "the thing being evaluated" itself.
In project-style agent scenarios, the runtime context directly includes the environmental factors that affect behavior:
- the project directory
CLAUDE.md- local skills
- repo files
- the tool-visibility scope
Rules:
cwdbelongs to the runtime context and is declared separately (the CLI's--control-cwd/--treatment-cwd, or eval.yaml's structuredcwd:field); it is not encoded into the variant expression.- To express "empty artifact + a specific runtime context", use a self-describing label as the artifact and supply the cwd separately, e.g.
--treatment project-env --treatment-cwd /path/to/project. - Do not collapse the project directory, project-level runtime context, and explicit artifact injection into a single concept.
6. Sample
A sample is one test-case record in the evaluation.
Rules:
- Code / API / file names / CLI flags keep
sample: theSampletype, thesample_idfield, theeval-samples.jsonfile name, the--samplesflag — these are common terms across the open-source API and the English-speaking LLM-eval world, and stay as-is. - User-facing Chinese copy defaults to「用例」, not「样本」: CLI output, report UI, error messages, doc prose, and the Chinese part of commit messages. This includes compounds like「用例数」/「用例难度」/「用例不足」/「跨用例散度」.
- Rationale: omk's
eval-samplesare test cases hand-picked by developers, not statistical samples randomly drawn from some distribution.「样本」implies "just run more and the sample size grows", which misleads users — what they actually need is more design, more cases.「用例」matches the engineering framing (test case) and the user's mental model when writing an evaluation ("I designed 5 cases"). - Exception: keep「样本」for statistical-terminology contexts — Cohen's d / Hedges' g "small-sample correction", "sample mean", "sample variance", "sample size", bootstrap "resampling", etc. These are fixed phrasings in statistics (small-sample correction / sample mean / sample variance / sample size / resampling); forcing them into「用例」would just make a stats-literate reader pause. Decision rule: does the word denote "one random draw from a population" (the statistical concept — then it's 样本), or "one hand-picked test case from a developer" (then it's 用例)? The two don't mix, and context makes it clear.
6.1 Sample metadata fields
The Sample schema has 4 optional metadata fields, purely for documentation / diagnostics; they do not participate in grading / judge / verdict. See docs/specs/sample-design-spec.md.
capability?: string[]— the capability dimension(s) this sample tests (can be multiple). Normalized case-insensitively, with dash / camelCase / underscore insensitivity.difficulty?: 'easy' | 'medium' | 'hard'— difficulty bucket (strict enum).construct?: string— the construct type this sample tests. Suggested:'necessity'(tests necessity, baseline-vs-skill) /'quality'(tests whether the skill is well-written) /'capability'(tests a specific capability). Free-form string allows custom values.provenance?: 'human' | 'llm-generated' | 'production-trace'— data source.
construct vs. capability (the two fields users most often confuse):
- construct = what class of thing this sample tests (necessity / quality / capability). It's the experiment-design level — running baseline-vs-skill tests necessity, running skill-v1-vs-skill-v2 tests quality.
- capability = which specific capabilities this sample tests (api-selection / error-diagnosis / fallback). It's the capability dimension of the object under test.
7. Task
A task is one concrete execution unit:
one sample × one artifact × one runtime context
Rules:
- The task layer does not directly represent an experiment conclusion.
- A task is the smallest unit of execution and scoring.
8. Trace
A trace is the process data produced during one execution, including:
- turns
- tool calls
- timing
- execution metrics like token / cost / cache
Rules:
- A trace belongs to the run result.
- A trace is used to explain differences in agent behavior, not to name the thing being evaluated.
3. Term boundaries
1. baseline means "nothing at all"
The standard meaning of baseline is:
- no explicit artifact injection
- no extra project-level runtime context attached
For most users, baseline can be read directly as "nothing at all".
If you want to isolate project-level runtime context, write it explicitly:
- use the self-describing label
project-envas the artifact, and--treatment-cwd /path/to/projectfor the cwd (or eval.yaml'scwd:field)
Here project-env is just an experiment-grouping label; the real meaning is "empty artifact + a specific runtime context".
2. skill is not the umbrella term
skill is used only when the object really is a skill file, a skill directory, or a skill-style system prompt.
Do not use skill as the umbrella term in these cases:
- comparing several objects of different kinds
- describing the generic CLI variant syntax
- describing future objects like agent teams, workflows, etc.
3. agent is not the umbrella term
agent describes an artifact or run form with agent-style runtime characteristics, e.g.:
- has tool calls
- has multi-turn traces
- depends on the runtime environment
But agent should not replace artifact as the generic term.
4. baseline kind and control role are not the same thing
baseline is one member of the ArtifactKind enum, denoting "empty artifact" (no explicit artifact injected). control is a value of experimentRole, denoting "this variant plays the control role in this experiment".
The two are orthogonal:
- A
baseline-kind artifact usually plays thecontrolrole, but that's not the definition. - When comparing two
skill-kind artifacts (v1 vs v2), one is explicitly declaredcontrol— here the control role has nothing to do with baseline kind. - Both reports and code should treat
experimentRoleas the single source of truth for identifying the control group, never inferring it back fromartifactKind === 'baseline'.
5. In omk, CI only ever means Confidence Interval
In omk, CI always means Confidence Interval, never Continuous Integration. This rule avoids confusion with the non-statistical "CI".
Rules:
- Continuous-integration internal helpers always use "gate": the
omk evalgate path /evaluateLayerGates/gateThreshold/LayerGateResult. - Confidence-interval contexts always use "CI":
bootstrap CI/diff CI/ thebootstrapCIfield / "95% CI". - Docs / comments / commit messages mentioning "CI" need no clarification — there is a single meaning, so the reader doesn't need context to disambiguate.
6. Stability = across repeated runs (test-retest), not cross-sample spread
The concept of stability aligns with psychometrics' test-retest reliability — score consistency of the same object across repeated runs. omk uses CV (coefficient of variation, an engineering measure of relative dispersion) as the primary metric; it is not fully equivalent to test-retest reliability in the strict psychometric sense (typically ICC or Pearson r), and is not a psychometric reliability measurement but an engineering approximation of the same family of concepts.
omk's concrete implementation: --repeat N runs the same (variant × sample) N times, and report.variance.perVariant[v] stores the score series across runs. The primary stability metric is CV = σ / mean (coefficient of variation, a dimensionless relative dispersion), with σ + 95% CI as secondary metrics. The thresholds <5% / 5~15% / >15% are empirical values on the 1-5 score scale, not figures cited from the literature.
What is not stability:
- The cross-sample min~max score range is not stability. The score spread of one variant across multiple samples comes mostly from the samples themselves differing in difficulty (eval-samples usually deliberately cover varied tasks), not from intrinsic variant fluctuation. Calling that range "stability" is a misreading — a reader who sees "100%" would wrongly assume the variant is very stable, when in fact the sample set may just be too narrow.
- Success rate is not stability. Success rate reflects "did the task complete" (execution health); "how much the score jitters across repeats" (measurement stability) is an independent concept. When success rate < 100%, it surfaces as a secondary-area alert, not as the primary stability metric.
UI conventions:
- In the six-dim comparison table, the "stability" column primary value: when variance data exists, show
CV X.X%; when it doesn't (single-run evaluation / no--repeat), show—plus a secondary-area需 --repeat ≥ 2. Honestly state what cannot be measured. - Industry alignment: Anthropic / OpenAI eval docs, Braintrust, Langfuse, etc. all treat variance across repeated runs as the core stability metric, not cross-sample spread.
7. Three scoring layers: fact / behavior / LLM judge
LayeredScores splits the composite into three orthogonal layers, with fields factScore / behaviorScore / judgeScore in order, displayed in the UI as "事实" / "行为" / "LLM 评价" respectively.
| Layer | Field | Source | Nature |
|---|---|---|---|
| Fact | factScore | pass rate of fact assertions (contains / json_schema / fact_check, etc.) | rule-verifiable · objective |
| Behavior | behaviorScore | pass rate of behavior assertions (tools_called / tool_output_contains / turns_max, etc.) | rule-verifiable · objective |
| LLM judge | judgeScore | the LLM judge's subjective rubric-based score (= results.llmScore) | model judge · subjective |
Why "LLM judge" isn't called "quality":
- The
compositescore = arithmetic mean of the three layers; external messaging uses the base four-dimension framework (quality / cost / efficiency / accuracy), where "quality" refers to the composite-score dimension. - If
judgeScorewere also called the "quality layer", a single report would carry both a header "quality 3.85" and a detail "quality layer: 4" — two numbers with completely different meanings, and the reader couldn't tell them apart. - "LLM judge" makes the source (the LLM judge) explicit and contrasts semantically with the rule-verification of "fact / behavior", so the three layers sit side by side without ambiguity.
judgeas a field name aligns with the existing termsjudgeExecutor/judgeModel.
Code conventions:
- In user-facing docs, UI labels, and changelogs, refer to this layer as "LLM 评价" (Chinese) / "LLM judge" (English).
- Code fields, types, and enum values uniformly use
judge/judgeScore/avgJudgeScore. - Do not reintroduce
qualityScore/avgQualityScorein new code (legacy v0.15 naming, removed in v0.16).
4. External expression conventions
1. Docs
User-facing docs use the following priority:
- top-level umbrella:
artifact - experiment grouping:
variant - experiment role:
control/treatment - runtime environment:
runtime context - concrete object type:
skill/agent/workflow
2. Command examples
In command examples:
- Use
--control <expr>+--treatment <v1,v2,...>to declare variants by experiment role. - The variant expression resolves to an artifact and a runtime context.
- Prefer concrete paths or concrete names in example objects; don't use a generic placeholder to stand in for every scenario.
- For complex experiment configs, prefer
--config eval.yaml; CLI parameters only carry the simple cases.
3. Reports and acceptance
Reports and acceptance docs should answer, in priority order:
- What artifacts is this comparing?
- What runtime context do they run in?
- Who is control, who is treatment?
- Does the difference come from the artifact itself, or from the runtime context?
5. Internal implementation conventions
1. Types and fields
New code prefers:
ArtifactArtifactKindartifactstask.artifactartifactHashesVariantConfig.experimentRole(added field, enum'control' | 'treatment')
2. De-compatibility strategy
omk is still in its 0-1 phase with a very small user base, so it does not proactively keep historical compatibility layers.
Rules:
- New implementations converge directly on the artifact terminology.
- If old naming would cause long-term ambiguity, delete it outright rather than keeping a compatibility alias.
- Make breaking adjustments now rather than snowballing backward-compatibility.
- From v0.16,
--variantswas removed outright (no deprecation warning); users migrate to--control/--treatment.
3. Naming principles
- Generic abstraction:
artifact - Concrete subtypes:
skill/agent/workflow - Experiment orchestration:
variant - Experiment role:
control/treatment(notbaseline/experiment) - Runtime environment:
runtime context/cwd
4. Reserve bare kind for ArtifactKind
In omk's product vocabulary, bare kind is reserved for Artifact.kind (ArtifactKind: baseline / skill / prompt / agent / workflow). baseline means the empty eval artifact; experiment role still comes from control / treatment. CLI design follows the same rule: a future --kind flag should mean artifact kind, not install target, report type, or observe event type.
For other discriminants, use a qualified name when the field is new or safe to rename. Existing persisted kind fields stay as-is unless a dedicated migration changes them:
report.kind→reportKind/documentKindevent.kind→eventKindexecutorRuntime.kind→runtimeKindstandard.kind→standardKind
Two caveats:
- Persisted discriminants are frozen. Any
kindalready serialized into a report / observe / doctor / diagnosis JSON file is a stored field name: renaming it would break deserializing existing on-disk files, so it needs a dedicated data / schema migration (not done here). This is serialization back-compat, not statistical comparability — renaming the field changes no measurement number. (report.kindadditionally sits in the Report-schema invariant list, so treat any change there with the usual schema care.) - Renaming internal non-persisted fields is progressive — done opportunistically when touching that code, not as a big-bang sweep. A CI guard freezes the current set of bare-
kinddeclaration sites so new unqualified ones cannot slip in.
6. Term mapping
| Old term | New standard term | Note |
|---|---|---|
| evaluand | artifact | unified umbrella for the thing being evaluated |
| EvaluandSpec | Artifact | core object type |
| EvaluandKind | ArtifactKind | object category |
| evaluands | artifacts | object list in the request |
| task.evaluand | task.artifact | the object a single task binds to |
| evaluandHashes | artifactHashes | content hash of the artifact |
| skillHashes | artifactHashes | unified object hash in the report |
| skill as the umbrella | artifact | skill falls back to a concrete subtype |
| agent as the umbrella | artifact / agent runtime | choose by semantics |
--variants CLI parameter | --control / --treatment | declare variants by experiment role; the flat list is gone |
inferring the control group from artifactKind === 'baseline' | read experimentRole === 'control' explicitly | the control group is user-declared, not inferred from artifact kind |
LayeredScores.qualityScore | LayeredScores.judgeScore | displayed as "LLM 评价" / "LLM judge"; avoids clashing with the "quality" header (composite) |
VariantSummary.avgQualityScore | VariantSummary.avgJudgeScore | same as above |
VarianceLayerKey: 'quality' | VarianceLayerKey: 'judge' | same as above |
7. Skill isolation (added in v0.22)
1. Problem background
When omk runs baseline-vs-skill evaluations, the baseline variant by default reaches every skill under ~/.claude/skills/ through three channels, so the baseline is not actually a "bare model" — a construct invalidity:
- SDK skill auto-discovery: the Claude Agent SDK scans
~/.claude/skills/by default and injects the skill list into the main session's system prompt. - subagent Skill tool: even if the main session has no skills, the SDK's built-in task subagent can still load skill content on demand by calling
Skill(...). - cwd file-system access: the baseline's default cwd is the user's evaluation working directory, which usually has a
skills/<name>/symlink prepared for the treatment; the baseline can follow that symlink with plainGlob/Readtools and readSKILL.mddirectly.
Only once all three channels are blocked is the baseline truly "bare". If any one is left open, the baseline routes around the others to reach skill content, and the verdict / Δ reflects a contaminated baseline vs. treatment rather than the real "no knowledge vs. knowledge".
2. Terminology
allowedSkills(per-variant field, added toArtifact/VariantConfig/EvalConfigVariant):undefined→ default SDK behavior (full discovery of~/.claude/skills/)[]→ full isolation:options.skills = []+options.disallowedTools = ['Skill']; the main session discovers no skills, and the subagent can't call the Skill tool either[name1, name2]→ whitelist:options.skills = [name1, name2], loading only the named skills. The subagent goes through a separate channel; in the whitelist case v1 does not force the subagent to follow.
--strict-baselineflag (default true): automatically setsallowedSkills = []for everykind === 'baseline'artifact;--no-strict-baselineturns it off (explicit opt-out).meta.skillIsolation(new report-meta field): a variantName → allowedSkills snapshot, used to validate comparability when comparing verdict / Δ across reports.
3. Defaults and priority
eval.yaml variant.allowedSkills (explicit)
> CLI --strict-baseline / --no-strict-baseline (batch)
> default (strictBaseline = true)baseline-kind defaults to [] (strict); other kinds default to undefined (full SDK discovery).
4. Isolation coverage
| Channel | Covered? | Mechanism |
|---|---|---|
| Main session skills | ✅ | options.skills = [] |
| SDK built-in task subagent calling the Skill tool | ✅ (when allowedSkills=[]) | options.disallowedTools = ['Skill'] |
| cwd file system (baseline → cwd → skills/ symlink → SKILL.md) | ✅ (strict + user gave no explicit cwd) | baseline cwd switched to the empty dir ~/.oh-my-knowledge/isolated-cwd/ |
| MCP servers | ✅ (blocked by default) | SDK settingSources defaults to [], omk passes no mcpServers |
AgentDefinition.skills whitelist fine-grained control | ❌ (known hole, not in v1) | follow-up: omk adds an agents option |
| script executor | ❌ | stderr warn; user-custom, doesn't participate in isolation |
Why the cwd channel is listed separately: after blocking only the two SDK channels (skills:[] + disallowedTools:['Skill']), the baseline's Skill tool calls do drop to 0, but the baseline can still use plain Glob / Read to follow the skills/<name>/ symlink under cwd and read SKILL.md, completely bypassing the SDK isolation. Root cause: omk defaults to baseline.cwd === null → the SDK falls back to process.cwd() = the user's evaluation working directory, which usually has a skills/<name>/ symlink prepared for the treatment. The fix is to switch the baseline's default cwd to ~/.oh-my-knowledge/isolated-cwd/ (an empty dir). When the user explicitly sets a cwd for the baseline, this is left untouched (explicit cwd = the user is responsible for keeping that dir clean).
Note: isolated-cwd is not a sandbox — the baseline can still Read any absolute path. But the model won't proactively guess the user's private paths (no system-prompt hint). If the evaluation scenario prompts the baseline to read an absolute path, an additional sandbox layer is needed (out of scope).
5. Cache key version
The cache key currently carries a v4: prefix, with allowedSkills, the executor name, and the executor runtime fingerprint folded into the key — switching strict / non-strict, crossing executors, or a binary / SDK version change will not falsely hit stale output.
6. Executor compatibility
| Executor | undefined | [] | [name] |
|---|---|---|---|
claude-sdk | full discovery (default) | skills:[] + disallowedTools:[Skill] | skills:[name] |
claude-cli | default | --disable-slash-commands --disallowedTools Skill | throw (user switches to sdk) |
script | default | stderr warn, non-blocking (no effect) | stderr warn, non-blocking (no effect) |
The claude-cli executor uses a double block, --disable-slash-commands (docs: "Disable all skills") + --disallowedTools Skill, equivalent to the SDK — it just lacks partial-whitelist capability, so a whitelist [name] requirement must go through claude-sdk (the SDK's skills option directly supports whitelist semantics). The script executor is user-custom and can't be guaranteed to honor isolation, so it only warns.
8. Decision criteria
When adding features, docs, or interfaces later and facing a naming choice, decide in this order:
- Is it describing the thing being evaluated? If so, use
artifact. - Is it describing the experiment grouping? If so, use
variant. - Is it describing the experiment role? If so, use
control/treatment. - Is it describing the run directory or environment? If so, use
runtime context. - Is it describing a concrete object type? If so, use
skill/agent/workflow. - If a single word mixes object, environment, or role semantics, split it apart and rewrite.