The one number: how we pick the metric that decides a pilot

Every pilot we run is scored against a single number we agree on before we build anything. Picking that number well is half the work. Here’s how we do it.

If a pilot has five goals, it has none. The moment success is fuzzy, every result can be spun as a win, and nobody learns anything. So before we write a line of code, we agree on one number — the number that, if it moves, makes the project obviously worth it, and if it doesn’t, tells us to stop. Choosing it is the most important hour of the whole pilot.

What a good number looks like

We test every candidate metric against four questions:

Does the business already care about it? The number should be one your team would recognise on a Monday — hours spent, cost per case, error rate, time-to-answer. Not a metric we invented to make AI look good.
Can it move in weeks, not quarters? A pilot has two weeks. “Revenue” is real but too slow and too noisy. “Minutes to process one invoice” moves now and points at revenue.
Can you measure it without us? If only we can compute the number, you can’t trust it. We pick something your own team can re-measure after we’re gone.
Does it translate to money, time, or risk? The number has to ladder up to something that matters: hours given back, money saved, mistakes avoided.

Method note — the counting rules Same definition before and after. A fresh sample the system hasn’t seen. A human checks a slice of the results. We write down the sample size and the caveats next to the number — a win on 400 cases means something; a win on 4 means nothing.

A number that passes all four usually sounds boring. That’s the point. Boring and countable beats exciting and vague.

1 One number, agreed before the build, measured the same way before and after — on a sample the system has never seen.

The vanity metrics we refuse

Some numbers look like progress and measure nothing useful:

Model accuracy on a clean test set. Your data is not clean. A model that scores 95% on a tidy benchmark can fall apart on your real Tuesday inputs.
Benchmark scores. “Beats GPT-X on Y” is a press release, not a business outcome. Nobody’s afternoon got shorter because of it.
Engagement / usage. People clicking the AI thing is not people getting work done faster. We measure the work, not the tool.
A single cherry-picked example. One amazing output proves the system can be right, not that it usually is. We measure over a sample, never an anecdote.

How we count it honestly

Once the number is chosen, the integrity is in the counting. We write down, in advance, what would count as a failure. Naming the stop condition before we start is how we keep ourselves honest when we’re tempted to round a disappointing result up to a success.

An example

Take invoice processing. The tempting metric is “extraction accuracy.” The honest one is minutes of human time per invoice, measured over a real batch, because that’s what the business feels and what your team can re-measure themselves. Accuracy is a means; the minutes are the number. If the minutes drop and the error rate holds, that’s a real win you can bank — not a benchmark you have to take on faith.

Why this protects you

Agreeing the number up front moves the power to you. You’re not being sold a feeling about AI; you’re being shown a measurement you helped define and can reproduce. It’s also why we can offer a guarantee: when “did it work” has one clear answer, we’re comfortable tying our final invoice to it.