Most AI systems we replace have one eval — a notebook someone ran once. The numbers in it
were good, the system shipped, and then nobody touched the eval again. Six months later,
the model was upgraded, the prompt was rewritten, the retrieval pipeline was rebuilt, and
the team had no idea whether any of it made the system better or worse.
The fix is structural. An LLM evaluation harness that survives a model migration has
five layers, written down separately, versioned separately, and run separately.
1. Inputs
The first layer is the dataset. Real inputs from production, labelled by humans, frozen at
a version. Devmint's default is 200–500 examples per task — enough statistical power for
the binary "is this better" questions; small enough that a senior engineer can read all
of it in a morning.
2. Expected outputs
For each input, we capture what a correct answer looks like — not as a single string,
but as a set of constraints. For a summarisation task: must mention the top three policy
clauses. For a classification: must land in one of these two categories. For a generation:
must cite at least one source from this list.
3. Scorers
A scorer takes an input, an expected-output spec, and a model output, and returns a score.
We default to multiple scorers per case — at least one deterministic (exact match,
regex, structured field), at least one LLM-as-judge (with its own eval), and at least one
human spot-check that runs on a 10% sample.
4. Harness
The harness runs every scorer against every input on every release. We hook it into CI
so a PR that breaks an eval is a red build, not a guess. The harness writes results to
a small Postgres table that the team can query and graph.
5. Replay
The fifth layer is the one teams skip — and the one that pays off when the model changes.
Every production decision is logged with enough context to be re-run. When a new
model lands, we re-run every past decision against the new model, score it with the
existing scorers, and compare. Migrations stop being a leap of faith.
The five layers separate cleanly. You can swap any one of them without rewriting the
others. The dataset doesn't care which model you're using. The scorers don't care which
prompt you wrote. The harness doesn't care which scorers you bolted on. The replay layer
doesn't care that the model has been swapped twice.
That separation is what makes an eval survive.