Three stages — doctor / eval / observe
doctor gates suite health before you spend a cent; eval scores knowledge variants through a five-layer pipeline; observe tracks runtime regressions. One framework, three jobs.
Measurement-grade evaluation & iteration for what you feed an LLM — prompt, RAG, skill, agent. Hold the model fixed, vary only the knowledge, and get statistically comparable diagnostics instead of vibes.