What a two-week AI pilot actually looks like, day by day

Most AI projects die as a demo because nobody agreed what “working” means. A pilot fixes that by being short, specific, and measured. Here is the real schedule we run.

A pilot is not a smaller version of a big project. It is a single, honest experiment: pick one workflow, agree on one number, and find out in two weeks whether AI moves it. Everything below is built to protect that experiment from the two things that kill it — vague goals and a demo that only works on a slide.

Day 1–2 — Scope, and pick the one number

We sit with the people who do the work, not just the people who buy the software. We watch the actual task: where the time goes, where the mistakes happen, what gets re-done. By the end of day two we have written down one workflow and one number we will try to move — and we have measured that number today, before any AI exists. A pilot with no baseline is not a pilot, it is a hope.

Method note — baseline first We measure the “before” number on real, recent work (say, the last 200 cases) and write down exactly how we counted it. If we can’t measure it cheaply today, it’s the wrong number — we pick one we can.

Day 3–4 — Build the thinnest slice that could possibly work

We do not build the whole system. We build the smallest end-to-end path that touches your real data and produces a real output — even if it only handles the easy 60% of cases. The point is to learn where it breaks on your inputs, not on a clean sample. Ugly and real beats polished and fake every time.

Day 5 — First checkpoint: go or stop

End of week one, we show you the thin slice running on your data and what it’s getting right and wrong. This is the first of two honest exits. If the signal isn’t there — if the data is too messy, the task too fuzzy, the value too small — we say so and you’ve spent two weeks, not two quarters. Most pilots pass this gate. The ones that don’t were the most valuable to stop.

The job of week one is to earn the right to build week two — or to fail cheaply.

Day 6–8 — Harden it on the cases that actually hurt

Now we go after the hard 40%: the edge cases, the weird formats, the exceptions your team handles on instinct. We add the checks that let the system know when it’s unsure and hand those cases back to a human instead of guessing. A system that says “I’m not sure, you look” is worth far more than one that’s confidently wrong.

Day 9 — Measure honestly, against the day-one number

We re-run the same measurement from day one, on a fresh sample the system has never seen, with a human checking the results. Same method, same definition, new data. Then we write down the result and what it doesn’t prove.

2 wks From “we think AI could help here” to a measured number on your real work — or an honest no, at a fraction of the cost of finding out the slow way.

Day 10 — Hand it over and decide what’s next

You get three things: the working system (running on your accounts), a one-page scorecard (before, after, sample size, method, caveats), and a plain recommendation — scale it, change it, or stop. There is no lock-in and no cliff. If the number moved, scaling is a choice you make with evidence in hand. If it didn’t, you keep the code and the lesson.

Why two weeks, and not two months

Short is a feature, not a compromise. A two-week clock forces us to pick a real problem, cut the scope to something testable, and confront the data early instead of after a big build. It keeps the risk on us and the decision with you. The slow version of this — months of scoping decks and a reveal at the end — is exactly how AI ends up as an expensive demo.