What Difference Does 99% Accuracy Make Over 95%? Not Much

On a weekly basis, we read of some new record of accuracy of some algorithm in a machine learning task. Sometimes it’s image classification, then it’s regression, or a recommendation engine. The difference in accuracy of the best model or algorithm to date from its predecessor is shrinking with every new advance.


A decade ago, accuracies of 80% and more were already considered good on many problems. Nowadays, we are used to seeing 95% already in the early days of a project. Getting more sophisticated algorithms, performing laborious parameter tuning, making models far more complex, having access to vastly larger data sets, and—last, but certainly not least—making use of incomparably larger computer resources has driven accuracies to 99.9% and thereabouts for many problems.

Instinctively, we feel that greater accuracy is better and all else should be subjected to this overriding goal. This is not so. While there are a few tasks for which a change in the second decimal place in accuracy might actually matter, for most tasks, this improvement will be irrelevant—especially given that this improvement usually comes at a heavy cost in at least one of the above dimensions of effort.

Furthermore, many of the very sophisticated models that achieve extremely high accuracy are quite brittle and, thus, vulnerable to unexpected data inputs. These data inputs may be rare or forbidden in the clean data sets used to produce the model, but strange inputs do occur in real life all the time. The ability for a model to produce a reasonable output even for unusual inputs is its robustness or graceful degradation. Simpler models are usually more robust, and models that want to survive the real world must be robust.

A real-life task of machine learning sits in between two great sources of uncertainty: life and the user. The data from life is inaccurate and incomplete. This source of uncertainty is usually so great that a model accuracy difference of tens of a percent may not even be measurable in a meaningful way. The user who sees the result of the model makes some decision on its basis, and we know from a large body of human/computer-interaction research that the user cares much more about how the result is presented than the result itself. The usability of the interface, the beauty of the graphics, the ease of understanding and interpretability count more than the sheer numerical value on the screen.

In most use cases, the human user will not be able to distinguish a model accuracy of 95% from 99%. Both models will be considered “good,” meaning that they solve the underlying problem that the model is supposed to solve. The extra 4% in accuracy is never seen but might have to be bought by much more resources both initially in the model-building phase as well as in the on-going model-execution phase. This is why we see so many prize-winning algorithms from Kaggle or university competitions never being used in a practical application. They have high accuracy but this high accuracy either does not matter in practice or it is too expensive (e.g., in complexity, project duration, financial cost, execution time, or computing resources) in real operations.

We must not compare models on the basis of the simplistic criterion of accuracy alone but measure them in several dimensions. We then will achieve a balanced understanding of what is “good enough” for the practical purpose of the underlying task. The outcome in many practical projects is that we are done much faster and with less resources. Machine learning should not be perfectionism but pragmatism.


Don't miss out on the latest technology delivered to your email monthly.  Sign up for the Data Science and Digital Engineering newsletter.  If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.