Skip to content

The three stages: doctor / eval / observe

omk is organized around three stages that map onto the lifecycle of a piece of LLM knowledge (a prompt / RAG context / skill / agent). They answer three different questions, and you usually use them in this order:

StageCommandQuestion it answersSoftware analogy
doctoromk doctorIs this artifact even well-formed enough to measure?lint + typecheck + smoke test
evalomk evalDid this change actually make it better — provably?the CI test suite
observeomk observeDoes it hold up on real production traces?production monitoring

doctor — preflight health, before you trust any number

doctor is a static + single-LLM-call health audit of one artifact: readability, metadata, dependencies, sample-contract alignment (static rules), plus LLM-scored dimensions (trigger boundary, doc clarity, instruction precision, …). It does not compare two versions — it tells you whether the artifact is in good enough shape to be measured at all.

It is also a gate in front of eval: omk eval runs the static doctor rules internally and refuses to run on a broken artifact, the same way CI runs lint before tests. A green doctor means "the measurement you're about to run won't be garbage-in".

→ How-to: run doctor checks

eval — the measurement core

eval is the heart of omk: an offline A/B that fixes the model and the samples, varies only the artifact (and its runtime context), and asks "did the new version beat the old one, beyond noise?". It produces the six-dimension report, the statistical machinery (bootstrap CI, length-debias, saturation, inter-judge agreement), and a one-line verdict (PROGRESS / REGRESS / CAUTIOUS / NOISE / UNDERPOWERED / SOLO) you can gate CI on.

This is where omk's measurement rigor lives. Everything in architecture, statistical rigor, and the scoring pipeline is about making this one number trustworthy.

→ Concepts: how it works · scoring pipeline → How-to: evaluate an agent · auto-improve a skill

observe — does it hold up in production?

eval is a controlled lab experiment on a fixed sample set. observe is the opposite end: it ingests real Claude Code session traces and turns them into a skill-health report — knowledge usage, gap signals, execution stability, cost, latency. It is observation, not scoring: it tells you where the knowledge base bumped into the unknown in real usage, so your next round of samples can target it.

→ How-to: observe production traces

How they chain

write / change an artifact


   omk doctor      → is it well-formed?  (gate)
        │ green

   omk eval        → is the change a real improvement?  (verdict → CI gate)
        │ shipped

   omk observe     → does it hold up in production?  → feeds new samples back into eval

The loop closes: observe surfaces real-world gaps → those become new eval samples → eval proves the next fix → doctor keeps each iteration well-formed. omk evolve automates the inner doctor → eval → rewrite loop; omk sample helps generate the test cases that feed it.