When a Small Language Model Beat Our LLM in Production — Avni Bhatt at AI Engineer Melbourne 2026
When a Small Language Model Beat Our LLM in Production: Right-Sizing Your Models
The team invested in a state-of-the-art large language model, reasoning that more capability would translate directly to better outcomes. The infrastructure was built to serve this model. The APIs were optimized for its specifications. Deployment happened. And then the metrics came back: every measure that mattered—latency, cost, accuracy, and reliability—showed that a smaller, fine-tuned model outperformed it.
This reversal reveals a widespread misunderstanding about how to think about language models in production systems. The technology world has trained us to assume that bigger is better: more parameters, more training data, more capability. But production systems operate under constraints that research environments don't. Every millisecond of latency matters. Every token of processing cost compounds. The failure modes of large models become increasingly expensive to handle at scale.
The smaller model worked better not because it was fundamentally more capable, but because it was right-sized for the actual problem. It was trained on examples specific to the use case. Its outputs were constrained to a predictable format. It failed gracefully rather than confidently hallucinating. It ran in less time, cost less money, and required less infrastructure. These aren't secondary considerations in production; they're primary ones.
The decision framework for selecting models tends to focus on capability benchmarks—what percentage accuracy does the model achieve on standard tests? This matters, but it's only one dimension. A capable model that's too slow to serve within your latency budget is worse than a less capable model that meets your requirements. A model that's so expensive to run that it makes your unit economics unworkable is worse than a cheaper model that's adequate.
There's also a temporal dimension that teams often miss. Large models improve constantly—newer versions, better training, new capabilities. But the optimization process for smaller, task-specific models is much faster. If you invest in fine-tuning a smaller model for your specific use case, you can iterate on your training data, evaluate quickly, and improve rapidly. The large model stagnates while you wait for the next official release. By the time the next version arrives, your fine-tuned small model may have significantly outpaced it through targeted improvement.
The infrastructure implications are substantial. Large models require specific hardware, careful batching strategies, and specialized serving infrastructure. Smaller models run on commodity hardware. They fit in memory. They're easier to version, test, and deploy. The operational burden of running a large model in production—keeping it reliably available, managing degradation patterns, handling edge cases—can exceed the burden of managing multiple smaller, specialized models.
Reliability is another dimension where smaller models often exceed larger ones in practice. Large models have broader behavior envelopes; they can do many things, which means predicting what they'll do in edge cases is difficult. Smaller models trained on specific data fail more predictably. When they fail, teams understand why. The failure modes are constrained. This makes smaller models easier to operate and easier to integrate into larger systems.
The right-sizing decision also forces teams to think clearly about their actual requirements. Do you need a model that can handle any possible input, or a model that handles your specific distribution of inputs well? Do you need general-purpose reasoning, or highly accurate performance on a narrow task? The honest answers to these questions usually reveal that smaller models are appropriate.
This creates a paradox in the field. The flashiest AI development happens around large models—they're where research happens, where headlines live, where funding flows. But production systems increasingly run on smaller, specialized models that were painstakingly tuned for specific tasks. The unsexy engineering work of fine-tuning, evaluation, and optimization is where the value actually lives.
The most sophisticated teams now treat large models as reference implementations and starting points. They download a model, fine-tune it on their specific data, aggressively evaluate on their specific use cases, and ruthlessly prune. They measure latency and cost as primary metrics alongside accuracy. They're willing to sacrifice points on generic benchmarks in exchange for significant improvements in production performance.
This approach requires different skills than chasing the latest state-of-the-art model. Fine-tuning requires data engineering discipline. Evaluation requires careful instrumentation of production systems. Cost-aware optimization requires understanding infrastructure in detail. But these are the skills that actually separate mediocre AI systems from excellent ones.
Avni Bhatt walks through the decision frameworks and measurement approaches that led to choosing a smaller model and achieving superior results at AI Engineer Melbourne 2026, June 3-4 in Melbourne, Australia.
Great reading, every weekend.
We round up the best writing about the web and send it your way each Friday.
