How to know when your machine learning model is ready for production
Machine learning (ML) teams are great at creating models that represent and predict real-world data, but effectively deploying these models is something that’s more akin to an art form than a science.
Deploying ML models requires skills that are more commonly found in software engineering and DevOps. According to Venturebeat, 87 percent of machine learning projects will never make it to the production stage. This damning statistic highlights that one of the most critical factors which can make or break a machine learning project is the ability to collaborate and iterate as a team.
When is a model ready for production?
The goal of building a machine learning model is to solve a problem, but this is something that can only be achieved when the model is in production and in active use by consumers. It can therefore be argued that the model deployment process is just as important as model building.
However, one of the biggest challenges faced by ML teams is actually knowing when a machine learning model is ready for production. In this article, we’re going to cover the basics of how ML teams can figure this out for themselves so as to streamline the jump from development to production.
While there is no clear or definitive measure of when a machine learning model is ready to be put into production, there are plenty of considerations that you can go through for each new model to decide when it’s the right time.
Consider the goal of your model
When you are trying to figure out whether your ML model is ready for production, it’s a good idea to consider the model’s original intended goal.
This is because the use case for a machine learning model will determine how stringent the requirements for deployment should be. If, for example, the model will be used for a simple use case, such as to make suggestions to the end-user, the requirements for deployment into production will be very different from the requirements for an algorithm that’s designed to automatically make critical decisions.
A brilliant example of this can be found in autonomous driving. The Society of Automotive Engineers (SAE) has defined five levels of driving automation which have been adopted by the U.S. Department of Transportation. These range from fully manual (zero) to fully autonomous (five). Depending on the type of model, it might be necessary to roll things out to production in stages according to defined thresholds, and only you can decide what’s appropriate for your model.
Look at how accurate your model is
Accuracy is vital to any machine learning model and is by far the most talked-about and strived-for metric. If a model cannot make accurate predictions, then there is absolutely no purpose for deploying it, and therefore ML teams strive for the best accuracy metrics possible relative to their algorithm’s limitations.
Depending on a model’s use case, however, direct test accuracy may not be the only metric that teams should measure when determining deployment readiness. Let’s imagine for a moment that we are building a model for deployment in a healthcare setting that’s going to make decisions about whether a patient needs a particular treatment. In this scenario, false positives could be very dangerous and therefore metrics such as precision, F1 score, and recall might be more appropriate to look at rather than accuracy.
If you are looking at accuracy, then you will be doing this in two stages: i) during the training phase and ii) during the testing phase. To evaluate accuracy during the training phase, teams will usually set aside a portion of their dataset as validation data and use it to validate their trained model. This can be used several times to find optimal hyperparameters.
In contrast, during the testing phase, ML teams should evaluate their model’s accuracy against data that hasn’t been used during training. This is to test whether a model is capable of producing accurate predictions with new data or whether it has simply memorized the datasets that it was trained on.
Define criteria in the ML pipeline
It’s important that any criteria that have been set by ML teams are codified into the machine learning pipeline. Accuracy requirements should be written as baselines so that ML teams can be certain that they’re being adhered to.
Newly deployed models should also serve as the benchmark for future models, and it’s always a good idea to compare new iterations against the production model during testing.
Challenges faced by models in production
Training and deploying ML models is a huge challenge for any machine learning team. There are several reasons why getting a model into production can be difficult, from the type of data that’s available to any workarounds that are required.
It’s important for ML teams to be aware of the challenges and issues that they might come across, so here are a few of the more common ones that could crop up:
Monitoring in production
All models in production need to be regularly monitored. It’s a key part of the workflow because so many things can change when a model has been deployed.
Let’s say that you deploy a model, and it performs well, but as time passes its performance gets worse. This might be because of new datasets with new information that are causing the model to change, or it could be because of a serious underlying issue that needs to be fixed by engineers. Regular monitoring helps to identify and rectify situations like these.
Class imbalance
Class imbalance is when instances of one class are more common in a dataset than instances of another. Although this is extremely common in many areas of life and is difficult to avoid, it can be quite problematic. In order to build effective ML algorithms, we need models that can perform well on either type of instance regardless of the imbalance.
A common cause of class imbalance is when the algorithm that’s being trained is to be used on a dataset from a new domain. One way to resolve this is to take more data from the new domain, which is costly. Alternatively, data could be taken from a similar domain with fewer class imbalances.
Bias
Let’s imagine we’re building a model to figure out the annual income of a group defined by a specific characteristic. You gather data from lots of people and notice that the results are all over, some people earn $20,000 while others earn above $80,000.
To try and capture the variability of the data you decide to bring in more data from other sources. Since you now have more data, you retrain your model and end up with a new set of values, but just like before, they’re all over the place. This is a source of bias.
The variance of values is so high, in fact, that it’s impossible to make statistical conclusions about your data because you don’t have enough data points from similar people to be able to make any concrete conclusions about the actual incomes of these people.
When you’ve got low bias and high model variance, chances are it has overlearned from the datasets. Linear algorithms tend to have a high bias but low variance while nonlinear algorithms are the opposite. When you increase the variance, bias will decrease and when you increase the bias, the variance will decrease. This trade-off can be difficult to manage.
You can never be certain
Although in an ideal world ML teams would know exactly when their models are ready for production, the truth is that they can never be completely sure.
Even with a clear agreement on required KPIs, high accuracy, F1, and lots of thorough testing on a test dataset, it’s impossible to say for certain that a model is 100 percent ready to make the transition from a training environment to deployment. A machine learning model could have been trained according to best practices and look like it’s capable of being deployed without any issues, only for everything to come crashing down later on. Whatever model is put into production, monitoring and collecting new data are the most important things for a production ML application and will determine its long-term viability; ML teams should focus on these.
What’s also important is developing a model in the right ecosystem. Building these ecosystems is becoming more complex as ML becomes increasingly advanced, and ML teams are increasingly turning to third-party tooling and platforms like Qwak as a result.
Qwak is the full-service machine learning platform that enables teams to take their models and transform them into well-engineered products. Our cloud-based platform removes the friction from ML development and deployment while enabling fast iterations, limitless scaling, and customizable infrastructure.
Want to find out more about how Qwak could help you deploy your ML models effectively? Get in touch for your free demo!