Every time a new frontier model lands, the same conversation runs on somebody's Slack channel. Usually a week or two after launch, usually between a client and their development team. Someone types a variation of the same line: let's get Claude everywhere through the stack.
A frontier model is the industry's shorthand for whichever model is currently the biggest and most capable on the market. The one making the headlines that week. Expensive to run, slower to answer, extremely good at hard things.
We've watched the pattern repeat for a while now. New model, bigger benchmarks, headlines, and within days a proposal lands that reaches for the new model wherever there's an existing AI touchpoint. The hope is that dropping in more capability will improve something. Occasionally it does. Often the task didn't need the extra capability in the first place.
The most recent version of the conversation was about a summary. Take some text, give me back a shorter version. Nothing exotic. The kind of job a much smaller, much cheaper model has been doing competently for a couple of years. The model being proposed for it was the frontier one.
That's the push pin on the noticeboard. And somebody in the room had just reached for the sledgehammer, when using their thumb would have been fine.

In safe hands, sure, the big model (or tool) will do the job, but it will also cost you ten or twenty times what you needed to spend... every time that task runs! And if the task runs a thousand times a day, that's going to be a big number on a real invoice at the end of the month.
The right question
The right question isn't which model is best. It's which model is good enough, fastest, for the least money.
Three things matter, and they pull against each other. Accuracy is how often the model gives you the answer you actually wanted. Cost is what you pay each time it runs, which adds up fast if the task runs a thousand times a day. And you can't ignore speed, which is how long the user, or the system downstream, is left waiting.
The frontier model tends to win on accuracy. The small, cheap ones win on cost and speed. The interesting question is almost never "who wins all three". It's "who clears the accuracy bar I actually need, and is cheapest and fastest from there down".
Illustrative
Three dials, pulling against each other
A sketch of what the comparison tends to look like for a routine task like summarisation. Real numbers vary by provider and by week.
| Model class | Accuracy on a summary | Cost per run | Speed |
|---|---|---|---|
| Frontier | Excellent | High | Slow |
| Mid-range | Very good | Medium | Medium |
| Small, fast | Good enough | Low | Fast |
Read across the rows. For a summary, the accuracy bar is modest: any competent model can condense text into a shorter version without mangling it. Once the small, fast one clears that bar, the rest of the table reads itself. You pick the cheap one and move on.
The only way to know it clears the bar is to test.
How we test
A benchmark is just a test. You take a real example of the work, run it through several models side by side, and compare what comes out. We have a test suite in-house that we set up for exactly this. Feed in examples from the use case, point it at a list of candidate models (cheap ones, mid-range ones, the frontier one), and let it run them all in parallel.
Scoring used to be the awkward part. If a human has to read every output and grade it, the test doesn't scale past a small sample.
So we do what a growing body of research suggests works: use a panel of models to grade the outputs, rather than asking one large model to be the judge. Cohere's research team published a good paper on this in 2024, showing that a jury of smaller, cheaper models tends to agree with human graders as well as, or better than, a single big judge. And at a fraction of the cost. It's the same instinct as a jury in a courtroom: several ordinary opinions, averaged, beat one expert's opinion with nobody to check it.
The output is a table much like the one above, but with your real numbers in the cells. Read across the rows. Pick the cheapest, fastest one that clears the accuracy bar you actually need.
Nine times out of ten, it isn't the frontier model.
There's a version of this conversation where the answer is "use a smaller model, it's fine". That's usually true, and it's also not good enough. Somebody in the room will push back, reasonably, that the cheap model might miss something the expensive one wouldn't. The only honest reply is: we don't know until we test.
So test. A day of benchmarking at the start of a project saves an argument every sprint and a surprise at the end of every quarter.
Bleeding edge is often just bleeding wallets. Most of the work a business wants to put AI on is ordinary work. Ordinary work deserves an ordinary tool, chosen on evidence, not on whichever model was trending the week the project kicked off.
Written by
James Dodd
Founder of moralai. Spent the last decade building software for people who don't describe themselves as technical.
Have a question this raised?
Talk to us, not a sales deck.
A short call, no prep needed. We'll level with you on whether there's anything worth doing here.