Sample design guide
For omk users: how to declare measurement metadata on a sample, write sandbox fields, and self-check before running an eval. The academic alignment behind it (HELM / IRT / Construct Validity / contamination defense, etc.) and the schema-extension decisions are laid out in the appendix (§6) — we put the full rationale on the table rather than tucking it away in an internal note.
1. Why sample design needs to be rigorous
omk's statistical-rigor stack (Bootstrap CI / Krippendorff α / length-debias / saturation curves / verdict) answers "is the conclusion computed correctly". But the conclusion is built on the sample set — if the samples themselves aren't rigorous, all the downstream statistical rigor is hollow.
The most common construct mismatch: you run baseline-vs-skill intending to measure "is the skill well written" (quality), but what the sample set actually measures is "baseline doesn't know some domain knowledge vs the skill provides it" (necessity). Both produce equally impressive verdict numbers, but they answer different questions — and without sample metadata declaring the construct assumption, that mismatch is invisible at the verdict output layer.
2. Sample metadata schema
# eval-samples.yaml
samples:
- sample_id: s001
prompt: "Draw a line chart in React; data is date + value, give minimal runnable code"
rubric: "Must identify the Line component + correct data format + include a chart render container"
assertions:
- { type: contains, value: "Line", weight: 1 }
- { type: regex, pattern: "data", weight: 1 }
# 4 optional metadata fields (docs/diagnostics only, never enter grading)
capability:
- component-recognition # string[], capability dimensions, multiple allowed; normalized case/dash/camelCase-insensitive
- api-selection
difficulty: easy # 'easy' | 'medium' | 'hard' (strict enum, typo-proof)
construct: necessity # 'necessity' | 'quality' | 'capability' suggested, custom string allowed
provenance: human # 'human' | 'llm-generated' | 'production-trace'Field semantics
- capability (
string[]): the capability dimensions this sample covers. Declare them from a capability-matrix perspective, so you can see "I cover component-recognition × 8 samples / api-selection × 6 samples / fallback × 2 samples, fallback is thin". Normalization rule: case-insensitive, plus dash / camelCase / underscore / space folding, soapi-selection/apiSelection/API_Selection/api selectionall count as the same capability. - difficulty (enum): a simple bucketing (easy / medium / hard). A typo like
difficulty: 'easy?'is rejected byloadSampleswith an error that names the sample_id. - construct (
string): which kind of thing this sample measures. Distinct from capability: capability is "which concrete ability is tested" (api-selection), construct is "which construct type is tested". Three suggested values:necessity: baseline-vs-skill, measures whether the skill is necessary at all. A large Δ doesn't necessarily mean the skill is well written — it may simply be that baseline doesn't know the domain knowledge (a self-evident conclusion).quality: skill-v1 vs skill-v2, measures which phrasing of the same knowledge lets the model answer more accurately. This is where omk's measurement rigor truly earns its keep.capability: measures the difference along one concrete capability dimension. Custom strings are allowed (e.g.regression-test/cost-efficiency); the studio won't error on a custom value.
- provenance (enum): data source.
human(hand-curated) /llm-generated(auto-injected byomk sample) /production-trace(sampled from production traces, which you import yourself).
Never enters grading / judge / verdict
These 4 fields are used only for:
- the studio coverage block, plus the
rubric_clarity_low/capability_thinissue detectors - the
report.analysis.sampleQualityaggregate (for tools to read)
They never enter the judge prompt (buildJudgePrompt(prompt, rubric, output, traceSummary) has no sample object in its signature, and test/grading/judge-prompt-isolation.test.ts guards against regressions). They never affect the verdict algorithm. This is a hard requirement for construct-validity protection — a judge seeing "construct: necessity" is a judge that knows the answer key.
Sandbox eval fields (mocks / environment / tripwire / mocksStrict)
To run evals decoupled from the real external environment (databases / APIs / filesystem / actual git push, etc.), a sample also carries a group of sandbox fields. The omk runtime matches mocks before a tool call; on a hit it returns fake data instead of really invoking the underlying tool.
- sample_id: s002
prompt: "Use antlogs-query to count ERROR logs in the last 1 hour"
rubric: "Must call the logstore_query tool, filter containing 'ERROR', time window 1 hour"
assertions:
- { type: tool_input_contains, value: "Bash:logstore_query", weight: 1 }
- { type: mock_hit, value: "Bash:1", weight: 1 }
mocksStrict: true # default true (generator-enforced); an unmatched tool call is denied outright, never passed through to the real call
tripwire: false # whether this sample is a "trap sample" (deliberately lures the LLM into the wrong move; failing is expected); default false
environment: # pre-eval "already provisioned" declaration; the LLM sees it and skips environment probing
cli_available: ["log-cli"]
files_available: ["~/.config/log-cli.json"]
notes: "log-cli is authenticated, token in env var"
mocks:
- tool: Bash # intercepted tool name: Bash / Read / Edit / Write / WebFetch / Grep / Glob, etc.
match:
command_glob: "*log-cli query --filter ERROR*" # Bash uses command_glob (* wildcard, spans newlines)
return:
stdout: '{"count": 42}'
exit: 0
- tool: Read
match:
file_path_endswith: "tasks/state.json" # recommended: suffix match, hits whether the LLM uses an absolute or relative path
return: '{"status":"running"}'
- tool: WebFetch
match:
url_glob: "https://internal.example.com/api/*"
return: "ok"Field semantics:
- mocksStrict (
boolean, defaulttrue): a tool call that matches no mock is denied outright (the LLM sees a failure result). Default behavior: theomk samplegenerator force-writestrueand the SYSTEM_PROMPT makes it explicit; for hand-written samples, the loader does not force-inject it when absent — an old sample without the field falls back to non-strict (passes through to the real call). Strongly prefertruefor new samples, to avoid a missing mock letting the eval hit a real production system. - tripwire (
boolean, defaultfalse): this sample is a "trap sample" whose prompt deliberately plants a lure that violates the rubric/skill (e.g. "I already know it's X, just use it"), testing whether the LLM blindly follows the user's wrong instruction. The LLM failing is the expected outcome; diagnostics seeingtripwire: truewon't suggest changing the skill, and the UI uses a purple verdict pill to distinguish it from a bug. - environment (
object, optional): a "ready" precondition declaration for the eval environment — after reading it the LLM skips environment probing (which X/test -f Y/echo $Z) and goes straight into the workflow. Think of it as a unit test's fixture / setup. It is only a prompt hint to the LLM; it does not actually create files or export variables. The doctor health check scans it for physical-path checks (skippable with--skip-doctor).cli_available: string[]— assumed already onPATHfiles_available: string[]— assumed-existing files/scriptsnotes: string— free-text fallback, describing credential / env-var state, etc.
- mocks (
object[], optional): the tool-call interception list. At runtime, mocks are matched in array order, and the first hit returns one ofreturn/return_file/return_seq[hitCount]as the tool_result.- the
toolfield: tool name (e.g."Bash"/"Read"/"Grep"). The special value"*"wildcards any tool name, paired withinput_containsfor intent-level mocking. - all entries under
matchare AND-ed:file_path: string— strict equality (~expanded). Use only when you can predict the full path (e.g.~/.config/x.json).file_path_endswith: string— suffix match:actual === suffix, oractualends withsuffixright after a path separator (/or\). The recommended default (claude-cli internally normalizes relative paths to cwd-absolute paths, so strict equality always misses).url: string/url_glob: string— for WebFetch / WebSearch, pick one.command_glob: string— for Bash,*wildcards across newlines (so the LLM's multi-line commands still hit).input: object— generic deep-equal subset match (any tool_input field).input_contains: string— recursively scans all string values in tool_input; a hit if any contains the substring (case-insensitive). Pair withtool: "*"for intent-level mocking: when the LLM searches code it might use Bash grep / the Grep tool / Glob / Read / Agent / any tool; useinput_containsto match intent by keyword instead of enumerating tools one by one. Example:{tool: "*", match: {input_contains: "MyServiceName"}, return: "<service .../>"}— any tool hits as long as its input mentions MyServiceName.
returnhas three forms: string /{stdout, stderr, exit}(simulates Bash) /return_fileexternal file /return_seq[]state machine (the Nth hit on the same mock returns in order, falling back toreturnonce exhausted).
- the
- assertion-side mock_hit / tool_input_contains: used together with mocks.
mock_hit: "Bash:2"means "the 2nd Bash mock must be hit at least once", proving the LLM reached that step.tool_input_contains: "Bash:logstore_query"checks that the Bash command string containslogstore_query.
Relationship to grading / judge: the sandbox fields (mocks / environment / tripwire / mocksStrict) never enter the judge prompt — the judge sees only prompt + rubric + LLM output + trace summary. tripwire only affects the diagnostic's attribution suggestion (the tripwire_intentional rootCause); it does not affect the layered scores or the verdict.
3. Sample-design analysis features
Coverage block (rendered on the studio report page)
The studio renders each report's sample-design coverage into a summary like this:
Sample-design diagnosis — health score 87/100
Total samples: 20, flagged: 3 (errors=0, warnings=1, infos=2)
📋 Sample design coverage:
capability: componentrecognition (8) | apiselection (6) | errordiagnosis (4) | fallback (2) [20/20 declared = 100%]
difficulty: easy (5) | medium (10) | hard (5)
construct: necessity (18) | quality (2)
provenance: human (15) | llm-generated (5)
avgRubric: 45 chars
[warning] capability_thin: 1 sample(s)
⚠ s019: capability "fallback" backed by only 2 samples (threshold 4, N=20) — a single sample failure makes this dimension's conclusion unstable
[info] rubric_clarity_low: 1 sample(s)
ℹ s007: rubric is only 12 chars and has no scoring-level word — ambiguous judge standard, judge scores may be unstableThe underlying data is persisted in report.analysis.sampleQuality, which tools can read directly as JSON.
Two issue kinds
rubric_clarity_low(severity: info): the rubric is shorter than 20 characters AND contains no scoring-level word (a 22-word zh/en list including "优秀/良好/合格/不合格/及格/满分/评分标准/至少包含" and English "excellent/good/poor/criterion/must include/at least", etc.). It's AND not OR, to avoid false-flagging a long rubric that just doesn't use a keyword. This is a prior/static signal, complementary to the existingambiguous_rubric(posterior/runtime, derived from judge stddev).capability_thin(severity: warning): a capability declared by only ≤max(2, totalSamples * 0.2)samples — that dimension has thin coverage, so a single sample failure makes the conclusion unstable. Small-N guard: when the total sample count is < 10 this check is skipped entirely, to avoid flagging everything in a small set.
4. Self-check checklist: is my sample design rigorous enough?
Run through this before an eval; any "no" is a reason to stop and think:
- [ ] Construct declared: does each sample know whether it measures necessity / quality / capability?
- [ ] Capability coverage: you claim to test N capability dimensions — does the sample set actually cover N? (the studio coverage block shows the real distribution)
- [ ] Difficulty stratified: do you have easy / medium / hard, or is everything hard so noise dominates?
- [ ] Provenance transparent: is the human-curated / LLM-generated / production-trace ratio reasonable? When LLM-generated is > 50%, watch for self-instruct risk (a self-reinforcing judge-bias loop).
- [ ] Sample count:
N < 5(exploratory) /N < 20(only large effects detectable) /N ≥ 20(medium effects detectable) — omk pre-flight already warns. - [ ] Rubric clarity: rubric ≥ 20 characters, with at least one scoring-level word (优秀/良好/必须包含/至少包含, etc.), so the judge has an actionable level standard.
- [ ] Prompt doesn't leak the answer: terms in the prompt shouldn't directly hand over the answer the rubric/assertion expects. If the prompt must contain some keyword (a product / library / API name) and the rubric also needs that word, you've weakened the "baseline has no knowledge" assumption — that's a natural sample trade-off and should be called out explicitly at design time.
- [ ] Construct matches the experiment design: when running baseline-vs-skill,
construct: necessityis the right call. When running skill-v1-vs-skill-v2, it should beconstruct: quality. - [ ] Provenance guards against contamination: an LLM-generated sample may share a source with the model's own training data (self-instruct bias); after
omk samplemarks it'llm-generated', a manual review pass is the v1 contamination defense. - [ ] Capability_thin guard: when N≥10, if a capability is propped up by only 1-2 samples, that dimension's conclusion is extremely unstable. Either add samples, or drop the capability (explicitly out of test scope).
5. How verdict interpretation pairs with construct
omk eval emits a verdict of PROGRESS / NOISE / REGRESS / CAUTIOUS / UNDERPOWERED / SOLO, and the verdict does not distinguish construct types — but your interpretation should:
- If the sample set is dominated by
construct: necessity→ PROGRESS means "the skill is necessary", and must not be read as "the skill is well written". To measure quality, follow up with a skill-v1-vs-skill-v2 run (construct: quality). - If the sample set is dominated by
construct: quality→ PROGRESS / REGRESS is the genuine "skill quality comparison" signal.
6. Appendix — design rationale (academic alignment & schema decisions)
This appendix lays out the reasoning behind the metadata schema: how omk's sample-design choices map to academic / industry consensus, which capabilities are deliberately out of v1 scope, and which v2 fields were considered and rejected (with reasons). It's here for the sake of full transparency — the practical guidance above is all you need to write good samples; this section answers "why it's designed this way."
6.1 Industry-consensus checklist & omk v1 coverage
omk's statistical-rigor stack (Bootstrap CI / Krippendorff α / length-debias / saturation curves / verdict) settles whether the conclusion is computed correctly — but the conclusion rests on the sample set. If the samples themselves aren't rigorous, all the downstream statistical rigor is built on sand. The table maps sample design to academic / industry consensus and marks omk v1's coverage.
| # | Industry gap | Academic / industry source | omk v1 status |
|---|---|---|---|
| 1 | IRT item discrimination: each item gets a (discrimination) / b (difficulty) / c (guessing) parameters; a < 0.3 is a junk item | IrtNet (2510.00844), Columbia IRT primer | out-of-scope (IRT is unreliable at N<30; left as follow-up — the v1 flat_scores heuristic already covers part of it) |
| 2 | Difficulty stratification: stratify samples by difficulty (MMLU-Pro filters difficulty via multi-model majority-correct) | MMLU-Pro | in-scope: Sample.difficulty enum + studio bucketing |
| 3 | Construct-validity trio (structural / convergent / discriminant) | Measuring what Matters (2511.04703), Measurement to Meaning (2505.10573) | in-scope: Sample.construct field (suggested: necessity / quality / capability) + verdict-interpretation callout; convergent / discriminant auto-detection is follow-up |
| 4 | Capability-matrix coverage (HELM's 16×7 matrix) | HELM (2211.09110) | partial: Sample.capability string[] field + studio coverage bucketing + capability_thin issue; detailed matrix visualization is follow-up |
| 5 | Contamination detection (canary / paraphrase / timestamp-locked) | BIG-Bench canary, LiveBench, contamination survey (2404.00699) | partial: Sample.provenance does "declarative" contamination tracking; real auto-detection is follow-up (needs an embedding model or training-data access) |
| 6 | Sample provenance / dataset card (the annotations_creators standard) | HF Dataset Cards, Synthetic Data survey (2503.14023) | in-scope: Sample.provenance enum + omk sample auto-injects 'llm-generated' |
| 7 | Adversarial / failure-driven mining (Dynabench) | Dynabench (2104.14337) | out-of-scope: omk evolve is currently one-directional; adversarial mining is follow-up |
| 8 | Production-trace natural-distribution sampling | Chatbot Arena (2403.04132) | out-of-scope: depends on external trace-system integration |
6.2 Acknowledged but not in v1 (follow-ups)
- IRT-style item discrimination (needs N≥30 + multi-model data)
- Multi-judge convergent / discriminant test (needs a ≥ 2-judge ensemble + aggregate analysis)
- Adversarial mining loop
- Production-trace natural-distribution sampling
- HTML renderer showing sample-design coverage (v1 is CLI-only)
- Evolve strategy upgrades (diversification signal / saturation-aware stop / health-weighted improvement)
- Gold-dataset auto-generation (reframed as an "annotation-process standardization" doc)
- Detailed N×D coverage-matrix visualization (v1 emits aggregate buckets + users visualize themselves)
- Contamination-detection algorithm implementation (canary string / paraphrase detection)
- User-defined rubric keyword list (
diagnostics.rubricKeywordsconfig)
6.3 v2 schema-extension candidates & rejection list
The v1 schema has only 4 fields (capability / difficulty / construct / provenance), all on the measurement-validity axis — answering does this sample measure what it claims to measure? Another common community ask sits on the asset-governance axis: tags / risk_level / expected_facts / source_ids / owner — answering who owns this sample, where it came from, how important it is. The two axes are orthogonal and don't conflict, but governance assumes measurement is already solid; v1 chose to solve measurement first. This section records the v2 candidates and the rejection list so future decisions don't re-litigate them.
v2 candidates (high-value, low-risk; add when real user demand triggers it)
source_ids?: string[]: concrete source identifiers (issue-123/doc:react-charts.md#line-chart/slack-thread-...). Fills the gap that theprovenanceenum is too coarse — provenance answers "machine / human / production", source_ids answers "which specific issue / doc section". High debug value (traceable sample origin), documentation-only, never enters grading. Cost: link rot is the user's to manage.status?: 'active' | 'deprecated' | 'superseded': a lifecycle field. As a sample set evolves, knowing whether a sample is "primary" or "being retired" matters for verdict interpretation — adeprecatedsample still runs but its Δ shouldn't count toward the headline conclusion. More important thanowner.
Rejected (with reasons, to avoid re-litigating)
tags?: string[]: semantically muddled withcapability. capability is "which specific ability is tested"; the "regression / p0 / edge-case" tags want to add belong either tocapability(an ability dimension) or tostatus(lifecycle). A free-form string with no enum constraint rots into a mess. Decision: don't add; force users to express it via capability + status.expected_facts?: string[]: heavily overlapsrubric+assertions: contains. omk's judge already does semantic scoring; expected_facts is just another alias for the same abstraction. Decision: don't add — it would create two places to write expectations during sample design, prone to drift.owner?: string: a governance field, mismatched with omk's measurement mission. omk doesn't consumeownerfor routing / notification; git blame / CODEOWNERS is the better home. Decision: don't add.risk_level?: 'p0' | 'p1' | 'p2': raises a real question (should aggregation weight samples by risk?), but solving it would touch the verdict formula — a measurement invariant. Today verdict / Δ are sample-uniform; weighting them into the verdict breaks cross-version comparability. Pure noise without a consumer, breaks the invariant with one — a dilemma. Decision: don't add; if it's ever needed, it must be its own project, designing a weighted aggregator alongside verdict v2.
Hard constraints before adding any new field
- Must not enter the
buildJudgePromptsignature (test/grading/judge-prompt-isolation.test.tsguards the regression) - Must not enter the
sampleHashcomputation (else it breaks cache-key cross-version comparability) - Must not enter the verdict / Δ algorithm
- Must not semantically overlap the existing 4 fields +
rubric/assertions
6.4 Sources
- Holistic Evaluation of Language Models (HELM, 2211.09110)
- Measuring what Matters: Construct Validity in LLM Benchmarks (2511.04703)
- Measurement to Meaning: A Validity-Centered Framework (2505.10573)
- Position: Medical LLM Benchmarks Should Prioritize Construct Validity
- Learning Compact Representations of LLM Abilities via Item Response Theory (IrtNet, 2510.00844)
- IRT primer — Columbia Mailman
- MMLU-Pro Benchmark methodology
- Synthetic Data Generation Survey (2503.14023)
- Auto Evol-Instruct (2406.00770)
- Dynabench (2104.14337)
- Comprehensive Survey of Contamination Detection (2404.00699)
- LiveBench: Contamination-Free Benchmark
- BIG-Bench Canary in GPT-4
- How to Publish Benchmarks Without True Answers (2505.18102)
- Hugging Face Dataset Cards
- Judging LLM-as-a-Judge with MT-Bench / Chatbot Arena (2306.05685)
- Chatbot Arena Open Platform (2403.04132)