Statistical rigor

omk's job is to answer "the knowledge you give your LLM — what's it actually worth?" with objective data. The biggest LLM-eval failure mode is confident bias — narrow CIs around the wrong answer. omk ships four pieces — bootstrap CI, length-debias and saturation on by default, plus Krippendorff α the moment you add a gold set — so conclusions can be externally audited.

This page is the depth reference. The README hero callout is the entry; come here for formulas, flags, and the why.

1. Bootstrap CI(`--bootstrap`)

Distribution-free confidence intervals.

The t-test breaks on ordinal LLM scores (Likert-like buckets, not normal-distributed continuous values). Bootstrap directly resamples the raw observations and stays valid at small N (< 30) and on skewed distributions.

Mean CI per variant — resampled with replacement N times (default 1000)
Pairwise diff CI between two variants — diff of resampled means; if the CI does not cross 0, the difference is significant at the chosen α (default 0.05 → 95% CI)
Output: each VariantResult.bootstrapCI carries [lo, hi] for mean and pairwise; HTML report draws CI bands; CLI omk eval consumes them in the 6-tier verdict logic
Reproducible by default: CIs use a fixed internal seed, so the same eval run twice yields byte-identical CIs and a stable verdict — no run-to-run coin-flip on significant near the boundary (a non-deterministic CI would silently flip the ship/no-ship call)

Reference: Efron & Tibshirani (1993), "An Introduction to the Bootstrap". omk implementation: src/eval-core/bootstrap.ts — the formula is covered by test/eval-core/bootstrap.test.ts, and the documented defaults (resample count, α) are kept in sync with the code constants by test/scripts/doc-constants-drift.test.ts.

2. Human Gold + Krippendorff α(`--gold-dir`)

Judge ↔ human agreement, anchored externally.

CI tells you "is the judge stable across resamples". α tells you "is the judge agreeing with a human standard". Two complementary axes:

Stable + low α = judge is consistently wrong (systematic bias)
Unstable + high α = judge agrees with humans on average but is noisy
Stable + high α = trust the judge for this rubric
Unstable + low α = judge is broken

omk auto-detects gold-judge collusion: if the gold annotator is the same model as the judge (e.g., both claude-3.5-sonnet), α inflates because both share the same biases. omk warns and treats that α as an upper-bound calibration signal, not as an adjusted score.

The same logic applies on the judge-vs-output axis: if the judge is the same model family as the executor that produced the outputs — the default, where claude:haiku judges claude:sonnet output — the judge's self-preference inflates scores. omk flags this (judge_self_preference, plus single_vendor_ensemble when a multi-judge panel is all one vendor) and points to the fix: a cross-vendor judge (--judge-models openai-api:gpt-4o) or gold calibration. Because omk holds the model fixed across baseline and treatment, self-preference largely cancels in the A/B delta — it bites absolute scores, version-regression curves, and cross-model comparisons, which is what the warning scopes itself to.

Today gold is reported as calibration evidence in the report and CLI output. It does not by itself change the headline verdict; use it to decide whether the judge is trustworthy enough for the decision context.

Formula: standard Krippendorff α with interval distance metric (δ²=(c−k)²; a defensible choice for 1-5 Likert). Implementation: src/grading/human-gold.ts. Inputs: a gold dataset directory with metadata.yaml and one or more annotation YAML files containing annotations: [{ sample_id, score, reason? }].

3. Debiased judge prompt: length / presentation / tone (default ON)

Research shows LLM judges over-weight verbosity, polished formatting, and confident tone — longer, prettier, more assertive answers score higher independent of quality (format / markdown bias; sycophancy / authority bias). omk's judge prompt explicitly states both "length is not a quality signal" and "presentation and tone are not quality signals" + uses chain-of-thought against the rubric. The wording is deliberately symmetric — it neither rewards polish/confidence nor penalizes plainness/hedging — so the debias instruction does not over-correct into a reverse bias.

Report metadata records the judge prompt hash (template version); changing any debias instruction changes the hash, and reports with different hashes are never compared blind
Presentation/tone neutrality is always on with no toggle; length-debias can be opted out via --no-debias-length for research / replication — for a dedicated length-bias audit, compare reports with and without that flag
These are "research says judges generally have this bias" prompt instructions; omk does not run a before/after bias measurement on its own judge. The real channel to validate that debiasing works is gold calibration (Krippendorff α vs human)
Reference: Saito et al. (2023), "Verbosity Bias in Preference Labeling by Large Language Models"

Frozen by: test/grading/judge-hash-frozen.test.ts — byte-level hash freeze prevents silent prompt drift across versions.

4. Saturation curve

Answers "have I run enough samples?"

With --repeat ≥ 5, omk accumulates cumulative N → bootstrap CI sequence. When CI shrink rate stays under 5% across 3 windows, the eval is saturated — more samples buy nothing, additional cost is wasted.

HTML report inlines an SVG saturation curve + verdict label
omk eval uses saturation as one input to the 6-tier verdict logic
Default window size: 3 consecutive measurements; threshold: 5% relative shrink in CI width
Reference: this is omk's own design, not a published method. Implementation: src/eval-core/saturation.ts

Why these four together

Each piece guards a different failure mode:

Failure mode	Guard
"v2 looks better but it's within margin of error"	Bootstrap CI(pairwise diff CI not crossing 0)
"Judge says v2 is better but I don't trust the judge"	Krippendorff α (judge ↔ human)
"Judge favors verbose / polished / confident answers"	Length / presentation / tone debiased judge prompt
"I ran 10 samples and stopped — was that enough?"	Saturation curve

Skip any one and you have a hole. Bootstrap CI, length-debias and saturation are on by default — you can opt out of length-debias for research replication, but those are otherwise unconditional; Krippendorff α turns on automatically once you supply a gold set (--gold-dir).

Verdict robustness — multiple comparisons + stability gate

The 6-tier verdict aggregates the pieces above into one ship/no-ship call. Two corrections keep that call honest when the design stresses it:

Multiple-comparison correction (Bonferroni). With K treatments compared against one control, testing each pairwise diff at α independently inflates the family-wise false-positive rate — the worst-case roll-up takes the loudest pair, so any single spurious "significant" pulls the headline. omk tests each comparison at α / K instead, holding the family-wise error at the nominal α. K = 1 (classic A/B) is unchanged. Each corrected VariantPairComparison records its effective alpha, and the report relabels the CI accordingly — a Bonferroni-widened interval is never shown as "95%".
Stability gate. A statistically significant gain that does not reproduce across runs is not shippable. When stability is actually measured (--repeat ≥ 2) and run-to-run variation is high (median CV > 15%), a PROGRESS verdict is downgraded to CAUTIOUS, with the instability surfaced in the headline. Single-run reports are not gated — stability is simply unmeasured there (the rationale says so), and auto-downgrading every single-run eval would be over-aggressive.

Implementation: src/eval-core/verdict.ts and src/eval-core/evaluation-reporting.ts. The CV threshold is kept in sync with the code by test/scripts/doc-constants-drift.test.ts.

Construct-validity isolation(`--strict-baseline`, default ON)

A fifth invariant, separate from the four above but also default-on:

baseline gets the prompt without the skill being tested. omk cuts three contamination paths so baseline doesn't silently see the skill it's compared against:

SDK skill auto-discovery
subagent Skill tool
cwd file-system access via the skills/<name>/ symlink

eval.yaml allowedSkills: [] can force strict isolation on any variant. Without isolation, any "v2 is better than baseline" claim is suspect because baseline may have been reading v2's own SKILL.md through one of the three channels.

See: docs/specs/sample-design-spec.md for related sample-design considerations.

Reproducibility / audit trail

Every report carries:

omk version (reportMeta.cliVersion)
Node version (reportMeta.nodeVersion)
Judge models + prompt hash (reportMeta.judgeModels — each entry carries its judge model and runtime fingerprint — and reportMeta.judgePromptHash)
Executor runtime fingerprint (reportMeta.executorRuntime)
Sample fingerprints (reportMeta.sampleHashes)
Skill isolation snapshot (reportMeta.skillIsolation)
Schema version (reportMeta.schemaVersion)

Cross-version comparability is enforced by BREAKING-COMPARABILITY callouts in GitHub Releases — when a measurement invariant changes, you'll see it.

Statistical rigor ​

1. Bootstrap CI(--bootstrap) ​

2. Human Gold + Krippendorff α(--gold-dir) ​

3. Debiased judge prompt: length / presentation / tone (default ON) ​

4. Saturation curve ​

Why these four together ​

Verdict robustness — multiple comparisons + stability gate ​

Construct-validity isolation(--strict-baseline, default ON) ​

Reproducibility / audit trail ​