24 March 20264 min read

Two out of three is bad

A free document comparison tool had been quietly hallucinating for a year inside one of our clients. Nobody had noticed, because most of the time it looked fine.

ByJames Dodd

Note filed under:

Meatloaf sang that two out of three ain't bad. He was singing about love, not software. A tool that is right two runs out of three, wrong the third, with nothing in the workflow telling you which is which. That is bad.

We were sitting with a client's team, mapping how a piece of work actually gets done. Not the version on the org chart. The real one: who does what, in what order, using which tool, at which point in the week. These sessions are slow on purpose. You learn the most from the asides.

The aside, this time, came from the person who always did this particular job. She mentioned, in passing, that she used a free online tool to compare two versions of a document. It wasn't on any list we'd been given. It wasn't in the software catalogue. It came up the way these things usually do, halfway through a sentence about something else.

We asked if she could show us.

She pulled up the tool and walked us through a recent comparison. It looked fine. The output was a tidy list of differences between the two documents, the kind a paralegal might produce. We asked for a few more examples. She had plenty. She'd been using it for over a year.

Back at our desks, we ran three comparisons of our own, using documents we'd chosen ourselves so we knew what the right answer was.

Two looked right.

The third contained differences that were not in either document.

A tool that hallucinates every run gets caught on day one. A tool that hallucinates one run in three hides for a year.

Two out of three, it looks right. You start to trust it. By the time the third one slips past, you've stopped checking. That's how this tool had been running for over a year. Nobody checking, because nothing looked wrong.

The tool was hallucinating. That's the word the industry uses when an AI model produces something confident and plausible that is also, on inspection, not true. In this case, it had told her that one document said something the document did not actually say.

We went back and asked, gently, whether she'd ever spotted the tool getting something wrong. She thought about it. She wasn't sure. She'd sometimes had a result that looked odd and gone back to check the original, and sometimes the check was reassuring and sometimes it nudged her to look again. But there was no log. No audit. No way, a year in, to know how often the tool had been right and how often it had quietly made something up that she'd then acted on.

The documents she was comparing were policy-adjacent. Not contracts with counterparties, but the internal rules the company lived by. A flagged "change" that wasn't really there could end up actioned. A policy could get rewritten to match a phrase nobody had ever written. At best, that's wasted work. At worst, it's a policy change the company didn't mean to make, landing somewhere it creates legal exposure.

None of this was her fault. It was a tool on the open internet, it did what it said on the label, and for two runs out of three it did it well.

We extended the delivery and built a replacement.

Most of the work was done the boring way. The new tool compares documents line by line, the same every time. Give it the same two documents twice and you get the same answer twice. No model, no guessing, no room to improvise.

AI is in there, but quietly, and only as a fallback. If the two documents have been reshuffled (sections moved, clauses renumbered, a paragraph lifted to a new page), the line-by-line method can't align them on its own. At that point, and only at that point, a model is asked to match the reshuffled sections to each other. Every time the model is invoked, the tool flags the case for a human to review before anything downstream acts on it.

The whole thing runs inside the client's own environment. Every comparison is logged: which documents, which method, which sections were AI-aligned, who reviewed the result. If a year from now somebody asks "how often did this get something wrong", there is an answer.

It is less impressive than the free tool. It does less. That is the point.

What this job taught us

Shadow AI (AI quietly running inside tools that nobody quite registers as AI tools) doesn't show up on the software list the IT director gives you. It turns up in conversation, on the third day of a mapping session, when somebody mentions in passing how they actually do the job. You find it by asking how things get done, not by reading the catalogue. If you only audit the list, you only audit the part you already knew about.

The answer, once you find it, isn't "ban AI". A policy that says "no AI" gets the same response as a policy that says "no personal phones at work": people use it anyway, and now they don't tell you. The answer is to know where the AI is, use the deterministic method by default, invoke AI only as the fallback, keep a human in the loop where it matters, and log everything so you can tell, later, whether any of it worked.

That's duller than the tool that does magic. It's also the only version we'd trust with a year of our own work.

Written by

James Dodd

Founder of moralai.co. A design led problem solver, with a photojournalism background, who has spent the last decade building software, brands and products for small businesses and the third sector.

More notes

Have a question this raised?

Talk to us, not a sales deck.

A short call, no prep needed. We'll level with you on whether there's anything worth doing here.

Book a 20-minute call