The five layers of an LLM evaluation harness that survives a model migration

Most AI systems we replace have one eval — a notebook someone ran once. The numbers in it were good, the system shipped, and then nobody touched the eval again. Six months later, the model was upgraded, the prompt was rewritten, the retrieval pipeline was rebuilt, and the team had no idea whether any of it made the system better or worse.

The fix is structural. An LLM evaluation harness that survives a model migration has five layers, written down separately, versioned separately, and run separately.

1. Inputs

The first layer is the dataset. Real inputs from production, labelled by humans, frozen at a version. Devmint's default is 200–500 examples per task — enough statistical power for the binary "is this better" questions; small enough that a senior engineer can read all of it in a morning.

2. Expected outputs

For each input, we capture what a correct answer looks like — not as a single string, but as a set of constraints. For a summarisation task: must mention the top three policy clauses. For a classification: must land in one of these two categories. For a generation: must cite at least one source from this list.

3. Scorers

A scorer takes an input, an expected-output spec, and a model output, and returns a score. We default to multiple scorers per case — at least one deterministic (exact match, regex, structured field), at least one LLM-as-judge (with its own eval), and at least one human spot-check that runs on a 10% sample.

4. Harness

The harness runs every scorer against every input on every release. We hook it into CI so a PR that breaks an eval is a red build, not a guess. The harness writes results to a small Postgres table that the team can query and graph.

5. Replay

The fifth layer is the one teams skip — and the one that pays off when the model changes. Every production decision is logged with enough context to be re-run. When a new model lands, we re-run every past decision against the new model, score it with the existing scorers, and compare. Migrations stop being a leap of faith.

The five layers separate cleanly. You can swap any one of them without rewriting the others. The dataset doesn't care which model you're using. The scorers don't care which prompt you wrote. The harness doesn't care which scorers you bolted on. The replay layer doesn't care that the model has been swapped twice.

That separation is what makes an eval survive.