22 April 20265 min read

Confidence isn't accuracy

A three-pass exercise that shows how these tools get a simple question wrong, then right, and still won't tell you they're sure.

ByJames Dodd

Note filed under:

A question I've put to a dozen AI tools over the last year, to benchmark them and to show teams what they do: who invented the electric toaster?

One caveat before we start, because the story has moved on a little. If you run this yourself today, some of the big tools now catch the question on the first pass, because the specific hoax got famous enough to patch. Your model may hedge, or name someone else straight away. Read on anyway. The structural point survives, and for every hoax that got famous enough to patch there are a thousand that didn't. The plain first answer is still built out of whatever the internet was saying that decade. The models were trained on the internet that believed him.

First pass, plain question. The answer comes back the way it usually does. A Scottish man called Alan MacMasters, in the 1890s, working in Edinburgh. A date. A patent reference. A little biographical flourish about his father and his workshop. It sounds settled.

Second pass. Ask the model to cite its sources. Here's the bit worth understanding. Asking a modern chat model for sources tends to push it off recall (what it remembers from its training data) and onto its web-search tool (a live lookup on the open internet). Different machinery, different answer.

And so, mid-reply, it walks itself back. It finds a 2022 Guardian piece and the exposé behind it. Alan MacMasters did not invent the electric toaster. Alan MacMasters is a Scottish graphic designer who, in 2012, added himself to the Wikipedia article as a joke with a friend while at university. The edit sat there for a decade. News outlets picked it up. A BBC children's programme repeated it. At least one printed book included him in a list of great Scottish inventors. Another Wikipedia editor got suspicious of the photograph in 2022, and the hoax came apart.

The correction happens in the model's own output. It told you MacMasters a minute ago. Now it's telling you MacMasters was a prank. Same chat, same session.

A single slice of toast sitting proud of a white two-slice toaster against a pale grey background. — The appliance is real. The inventor the models name is not.StockSnap / Pixabay

Third pass, on the corrected answer. Ask the model how confident it is, on a scale of one to ten, that the electric toaster was in fact invented by one of the people it now names (Crompton's company, Strite, one of the usual candidates).

Eight. Sometimes nine. Never ten.

That's the signal worth sitting with. The answer is probably right this time. The model is telling you it still isn't sure.

What tends to happen, if you run this yourself, is a small, private version of the stages of grief. Disbelief first, when the model corrects itself on screen a minute after sounding certain. Then a quiet irritation at the tool, the kind you'd feel at a colleague who told you something with great certainty and then quietly walked it back. Then the bargaining. If I'd asked it differently. If I'd used the paid version. If web search had been on the whole time. Those don't tend to change much.

It isn't really the model's fault. The model read what was there. A confident lie, repeated often enough, looks exactly like a confident truth by the time the training data is assembled. The machine is doing its job. The job, it turns out, is downstream of whatever the internet happened to be saying that decade.

Two things worth taking from this.

Better questions get better answers. These tools are genuinely good at what they do. A vague prompt gets a plausible answer shaped like the average of the internet. A specific one, with constraints and context and a hint of what you already know, gets something much closer to useful. Most of the bad AI output in the wild is a bad question dressed up as a tool problem.

Asking for sources and asking for confidence do different jobs. Asking for sources is what forces the model to check itself, by pushing it off recall and onto a live lookup. Asking for a confidence score is what tells you whether the checked answer is actually settled, or merely the best available guess. You need both. If you're using these tools to produce anything that goes to a customer, a regulator, a board, or a classroom, and you haven't built in a step that asks where did this come from, and how sure are we, you are printing a book with facts based on a university student's prank. Not metaphorically. That is the thing that happened.

Repeat a lie often enough and it becomes the truth.

Attributed to Paul Joseph Goebbels, without evidence.

The toaster is one kind of failure. Bad training data, confidently repeated. Worth naming a couple of others in the same breath, because they get lumped together and they shouldn't be. Ask a model how many r's are in "strawberry" and it will often get it wrong. That has nothing to do with what it read. It doesn't see letters the way you do, it sees tokens and patterns, and counting characters inside a word is the wrong shape of task for the thing it is. Ask it whether you should walk or drive a hundred metres to the car wash, and it will sometimes tell you to drive. That's a reasoning failure, also nothing to do with bad data.

The useful point is that these are different problems with different fixes. Better context helps with the first. Careful prompting and a human check help with the second and third. Domain-specific training helps in places none of the above reach. I'll get into those in later Notes. For now, the three passes are the one habit that does something useful against all three.

The worked example is MacMasters, but the pattern is the thing to take away. The three passes transfer to any claim with a whiff of trivia about it. Minor historical firsts. Regional origin stories. Disputed attributions. Anything where the internet has opinions and nobody important has bothered to correct them.

Pick one and run the same three passes. Ask the confident question. Ask for sources. Ask for a confidence score. Then check, properly, against something the model didn't write.

The gap between what the tool tells you and what turns out to be true is the shape of the checking you still need to do yourself.

Written by

James Dodd

Founder of moralai. A design led problem solver, with a photojournalism background, who has spent the last decade building software, brands and products for small businesses and the third sector.

More notes

Have a question this raised?

Talk to us, not a sales deck.

A short call, no prep needed. We'll level with you on whether there's anything worth doing here.

Book a 20-minute call