11 April 20264 min read

Not every task needs the biggest model

Every time a new frontier model lands, someone proposes plumbing it everywhere. The right answer isn't a smaller model. It's a test.

ByJames Dodd

Note filed under:

Every time a new frontier model (the industry's shorthand for whichever model is currently the biggest and most capable on the market, expensive to run and very good at hard things) lands, the same conversation runs on somebody's Slack channel. Usually a week or two after launch, usually between a client and their development team. Someone types a variation of the same line: let's get Claude everywhere through the stack.

We've watched the pattern repeat for a while now. New model, bigger benchmarks, headlines, and within days a proposal lands that reaches for the new model wherever there's an existing AI touchpoint. The hope is that dropping in more capability will improve something. Occasionally it does. Often the task didn't need the extra capability in the first place.

The most recent version of the conversation was about a summary. Take some text, give me back a shorter version. Nothing exotic. The kind of job a much smaller, much cheaper model has been doing competently for a couple of years. The model being proposed for it was the frontier one.

That's the push pin on the noticeboard. And somebody in the room had just reached for the sledgehammer, when using their thumb would have been fine.

A row of hammers of varying sizes and shapes hanging on a workshop wall, from small tack hammers up to heavy mallets. — A wall of hammers. The craft is knowing which one to reach for.Juno Jo / Unsplash

In safe hands, sure, the big model (or tool) will do the job, but it will also cost you ten or twenty times what you needed to spend... every time that task runs! And if the task runs a thousand times a day, that's going to be a big number on a real invoice at the end of the month.

The right question

The right question isn't which model is best. It's which model is good enough, fastest, for the least money.

Three things matter, and they pull against each other. Accuracy is how often the model gives you the answer you actually wanted. Cost is what you pay each time it runs, which adds up fast if the task runs a thousand times a day. And you can't ignore speed, which is how long the user, or the system downstream, is left waiting.

The frontier model tends to win on accuracy. The small, cheap ones win on cost and speed. The interesting question is almost never "who wins all three". It's "who clears the accuracy bar I actually need, and is cheapest and fastest from there down".

Illustrative

Three dials, pulling against each other

A sketch of what the comparison tends to look like for a routine task like summarisation. Real numbers vary by provider and by week.

Model class

Accuracy on a summary

Cost per run

Speed

Frontier

Excellent

High

Slow

Mid-range

Very good

Medium

Small, fast

Good enough

Low

Fast

Frontier

Accuracy on a summary: Excellent
Cost per run: High
Speed: Slow

Mid-range

Accuracy on a summary: Very good
Cost per run: Medium
Speed: Medium

Small, fast

Accuracy on a summary: Good enough
Cost per run: Low
Speed: Fast

Read across the rows. For a summary, the accuracy bar is modest: any competent model can condense text into a shorter version without mangling it. Once the small, fast one clears that bar, the rest of the table reads itself. Pick the cheap one. Move on.

The only way to know it clears the bar is to test.

How we test

A benchmark is just a test. You take a real example of the work, run it through several models side by side, and compare what comes out. We have a test suite in-house that we set up for exactly this. Feed in examples from the use case, point it at a list of candidate models (cheap ones, mid-range ones, the frontier one), and let it run them all in parallel.

Scoring used to be the awkward part. If a human has to read every output and grade it, the test doesn't scale past a small sample.

So we do what a growing body of research suggests works: use a panel of models to grade the outputs, rather than asking one large model to be the judge. Cohere's research team published a good paper on this in 2024, showing that a jury of smaller, cheaper models tends to agree with human graders as well as, or better than, a single big judge. And at a fraction of the cost. It's the same instinct as a jury in a courtroom: several ordinary opinions, averaged, beat one expert's opinion with nobody to check it.

The output is a table much like the one above, but with your real numbers in the cells. Read across the rows. Pick the cheapest, fastest one that clears the accuracy bar you actually need.

Nine times out of ten, it isn't the frontier model.

There's a version of this conversation where the answer is "use a smaller model, it's fine". That's usually true, and it's also not good enough. Somebody in the room will push back, reasonably, that the cheap model might miss something the expensive one wouldn't. The only honest reply is: we don't know until we test.

So test. A day of benchmarking at the start of a project saves an argument every sprint and a surprise at the end of every quarter.

Bleeding edge is often just bleeding wallets. Most of the work a business wants to put AI on is ordinary work. Ordinary work deserves an ordinary tool, chosen on evidence, not on whichever model was trending the week the project kicked off.

Written by

James Dodd

Founder of moralai.co. A design led problem solver, with a photojournalism background, who has spent the last decade building software, brands and products for small businesses and the third sector.

More notes

Have a question this raised?

Talk to us, not a sales deck.

A short call, no prep needed. We'll level with you on whether there's anything worth doing here.

Book a 20-minute call