Model Deployment: Strategies, Best Practices, and Use Cases
In machine learning, much focus and attention is given to the practices and experimentation strategies surrounding training a model. Model training requires identifying suitable input datasets, transforming and preparing data, selecting or designing the appropriate algorithm or framework, hyperparameter tuning, experiment tracking, and careful analysis, testing, and evaluation to determine the best variation of your model.Â
Training a machine learning model is a challenging task that receives ample attention and consideration in tech discussions and blog posts, but it only solves half the problem. After we have determined that our model is capable of making accurate predictions, we need to decide how we’re actually going to use the model. How are we going to integrate this model with our existing applications and services so that it can provide real value to our product or organization? In order to make the appropriate plans and decisions, we need to ask ourselves key questions:
- How quickly will our predictions be served?
- How will we update our model?
- How will we evaluate our model against new iterations or versions?
- How will we ensure that our model is continuing to meet performance expectations over time?
While it’s easy to think of machine learning models as data experiments that live in a one off notebook, in order to extract the full predictive value of a machine learning model, we need to treat it like any other production grade service. In this blog post, we will discuss and answer the following questions:
- What is a Model Deployment?Â
- What are the different deployment serving methods and which you should choose?
- Batch
- Real Time
- Streaming
- How can you iterate on existing models and compare performance?
- Shadow Deployments
- A/B Testing
- How should you monitor your inference service in production?
- How should you scale your application?
What is Model Deployment?
Model deployment, also known as inference, marks the transition of a machine learning model from the development phase to its operational use in real-world applications. In essence, it involves making the predictive capabilities of the trained model available to end-users or other systems in a production environment.Â
During deployment, the model is integrated into existing software services, where it receives input data - either in the form of a request, batch input source, or data stream - processes it through the learned algorithms, and generates predictions or classifications as output. This process is crucial for realizing the value of machine learning models, as it enables organizations to automate decision-making, enhance efficiency, and drive innovation across various domains, from healthcare and finance to e-commerce and autonomous vehicles.Â
What are the Types of Model Deployment?
When preparing your inference deployment, one of the first decisions you’ll need to make is the method of delivering the predictions. Do your predictions need to be served in real-time for instant decision making? Are your predictions dependent on some user action? Does your organization have an event stream or event bus that needs to receive the predictions for downstream application? How much resources are you willing to dedicate to your inference service?
It’s necessary to carefully consider the questions above as they will guide you to the appropriate deployment method. Let’s take a deeper look at the three major deployment methods for serving inference batch, real time, and streaming.Â
Batch Inference
Batch inference is a method of serving predictions or classifications on large volumes of data in bulk, typically offline and at scheduled intervals. In this deployment mode, input data is collected over a period of time - usually in a data lake or data warehouse like object storage, relational database, snowflake - ingested as a batch, and then sent into the deployed machine learning model to generate predictions or inferences.Â
Batch inference deployments are commonly used in scenarios where latency requirements for providing predictions are not strict, such as generating daily reports, analyzing historical trends, or conducting periodic batch processing tasks. This approach offers advantages in terms of resource optimization, as a batch inference model will scale up during execution time and immediately scale down after the batch is processed. However, it may introduce latency in obtaining predictions, making it less suitable for time-sensitive applications.
Real Time Inference
Real-time inference deployment involves executing predictions or classifications on incoming data with minimal latency, enabling immediate responses to user queries or events. Usually in the form of a REST endpoint or GRPC service, the deployed machine learning model processes data as it arrives, providing instant predictions or inferences to support time-sensitive applications.Â
Real-time inference deployments are essential for applications where immediate decision-making is critical, such as fraud detection, recommendation systems, and autonomous driving. This approach requires high-performance computing infrastructure capable of handling rapid data processing and low-latency response times. While real-time inference deployments offer unparalleled responsiveness, they do require the management and maintenance of more infrastructure.Â
They constantly use resources, need to effectively respond and scale to changes in traffic velocity, and require configurations for networking, endpoint definition and security. While real time inference services can certainly be costly, they can often be justified if the real time benefit provided to the user outweighs the cost of infrastructure.Â
Streaming Inference
Streaming inference deployment involves the continuous processing of data streams in real-time, where input data is ingested from sources such as Apache Kafka, processed by a deployed machine learning model, and the resulting predictions or inferences are outputted back into Kafka for further downstream processing or action. This deployment mode is particularly suitable for applications requiring extremely low-latency, high-throughput, and near-real-time decision-making, such as clickstream analysis, anomaly detection, and IoT sensor data processing.Â
Leveraging streaming inference deployments enables organizations to achieve timely insights and responses to evolving data, facilitating dynamic decision-making and automation. However, implementing streaming clusters is no easy feat. Managing Kafka clusters at the organizational level requires engineering expertise, time, and a dedication to understanding streaming architecture. Even with managed services like MSK or Confluent, streaming architecture is less likely to be implemented for a one off use case in a project or model.Â
That being said, despite the complexities involved, streaming inference deployments offer unparalleled agility and responsiveness, empowering organizations to harness the full potential of real-time data analytics for driving innovation and delivering value.
Which Should You Choose?
When deciding on an inference framework, the first decision you need to make is between batch or real-time prediction serving. There are two key criteria you should consider: is your model able to serve the predictions for each request or batch of requests in a reasonable time, and is the value provided to your users or downstream customers by an immediate prediction greater than the cost of continuously running infrastructure to maintain the prediction service.Â
If the answer to both of these questions is yes, it’s worth considering a real time inference architecture. If the answer is no, batch is probably a more suitable and reasonable choice. Batch executions can also be flexible, they don’t need to be executed on a schedule. Utilizing ad-hoc batch executions and some sort of a pipeline architecture, you could return predictions to your users or customers in near real time, at a fraction of the cost and compute.
Now, if you’ve decided to go beyond batch, the next decision to be made is between real time and streaming inference serving. This is a more difficult decision to be made that again, ultimately comes down to your latency and throughput requirements. If optimal low latency and performance is your ultimate goal, streaming is probably the best bet.Â
It will require a greater investment in technology and resources, but this is definitely the best approach for the fastest possible inference serving. On the other hand, if low, often single digit ms latency, is acceptable for your project, real time inference serving will provide a great value for result delivery, cost optimization, and engineering resources. Real time inference is much easier to stand up, maintain, and iterate on.Â
Ultimately, there is no all encompassing answer as to which inference method you should choose. The answer lies in understanding your data, model, and specific user requirements.Â
Iteration, Testing, and Performance Comparisons
Now that you’ve selected a method of inference deployment, you have your model running and serving predictions. But a few months down the line, you find a new feature or an upgrade to your model framework is made, and you need to deploy a change to your model.
This introduces a number of new complexities. How will you roll out your model so that traffic isn’t disrupted on your service? How will you ensure that your new model performs the same or better against test cases? How will you conduct specific A/B tests against your new and existing model to see how the model performs against specific controlled feature conditions? Addressing these concerns will require you to investigate and implement CI/CD strategies and testing frameworks that will strengthen the effectiveness of your models and reliability of your processes. Let’s discuss a few strategies.
Blue/Green Deployment
Blue-green deployment is a deployment strategy commonly used in software development to minimize downtime and risk when deploying updates or changes to a system. In this approach, two identical production environments, referred to as "blue" and "green," are maintained concurrently. At any given time, one environment serves as the active production environment (e.g., blue), while the other remains inactive (e.g., green), ready to be updated or rolled back as needed.Â
When a new version of the system is ready for deployment, it is first deployed to the inactive environment (green). Once the deployment is verified and validated, traffic is gradually routed from the active environment (blue) to the newly deployed environment (green), ensuring a seamless transition with minimal disruption. If any issues arise during the deployment process, traffic can quickly be redirected back to the stable environment (blue) while the issues are addressed.Â
In the context of machine learning inference, blue-green deployment can be leveraged to seamlessly roll out updates or improvements to machine learning models while minimizing disruptions to production services. By maintaining two identical inference environments, organizations can deploy new versions of machine learning models to the inactive environment (green) and gradually shift traffic from the active environment (blue) to the updated environment (green) once the new model's performance has been validated.Â
This approach ensures that users experience minimal downtime and uninterrupted service while benefiting from the enhanced capabilities of the updated model. Additionally, in the event of unexpected issues or degradation in model performance, traffic can easily be redirected back to the stable environment (blue) until the issues are resolved.Â
Shadow Deployment
Shadow deployment is a deployment technique that involves running a development replica of a model alongside the existing production environment without affecting live traffic. In shadow deployments, a copy of the production model input traffic is sent to the replica, but instead of serving the predictions or responses to users, it simply observes and records its outputs. This allows organizations to evaluate the performance and behavior of the new version in a real-world environment before fully committing to its deployment.Â
By comparing the outputs of the new version with those of the existing production system, organizations can assess the impact of the changes and identify any discrepancies or issues that may arise. Shadow deployments are particularly valuable for machine learning inference, as they enable organizations to test new models or algorithm changes in a safe and controlled manner, without risking disruptions to the production environment.Â
Additionally, shadow deployments provide valuable insights into the potential performance improvements or regressions introduced by the new version, helping organizations to make informed decisions about when and how to transition to the updated system. Overall, shadow deployments offer a powerful mechanism for validating changes and mitigating risks in machine learning inference deployments, ultimately contributing to more robust and reliable production systems.
Multi-Armed Bandits
For a more dynamic testing approach, multi-armed bandit algorithms offer an approach to managing and optimizing the selection of models or algorithms for making predictions or classifications in real-time. Instead of relying on fixed allocation strategies or manually tuning model serving configurations, multi-armed bandits continuously explore and exploit different inference options to maximize performance metrics such as accuracy, latency, or user satisfaction.Â
By dynamically allocating traffic to different versions of deployed models based on their observed performance, multi-armed bandits enable organizations to adaptively optimize resource utilization and ensure that the most effective models are used to serve incoming inference requests. This adaptive deployment strategy is particularly valuable in environments where the performance of deployed models may vary over time due to changes in data distributions, feature drift, user behavior, or other external factors.Â
However, this approach does require both the storage and access of real-time output prediction data from the model in a fast, consistent method. The data needs to be stored in a way that can provide easy analysis of both features and the model’s prediction data over time, as well as an ability to easily compare these outcomes against different model variations.Â
For each request, you would need to sync the input request features, any additional variables that may have been computed pre inference, as well as the actual output prediction. Automatically syncing this data to a data lake or query tool like Snowflake or Bigquery would facilitate in the process of making these dynamic evaluations that influence traffic strategy.Â
CI/CD
With CI/CD, changes to machine learning models, whether it's tweaking hyperparameters, fine-tuning architectures, or incorporating new training data, can be automatically tested, validated, and deployed in a controlled and reproducible manner. Leveraging CI/CD pipelines, organizations can automate the end-to-end process of building, testing, and deploying machine learning models, reducing manual intervention and human error while accelerating time-to-market for model updates.Â
Platforms like GitHub Actions provide powerful tools for implementing CI/CD workflows for machine learning inference, offering seamless integration with version control systems like Git. By utilizing GitHub Actions, teams can automate tasks such as model training, evaluation, and deployment, enabling continuous monitoring and validation of model performance.Â
Observability, Monitoring and Alerting
Once a model leaves the notebook stage of development and is ready to receive traffic to generate predictions, it ceases to exist as an experiment and begins to function as a full-fledged production service. And as a production service, it should receive the same monitoring, alerting, and observability that you would apply to any API or customer facing application service.Â
Implementing proper observability and monitoring for your machine learning inference services will ensure that your services run with limited down time and enable your developers with the tools needed to efficiently debug and remediate a bug or service outage. In addition, proper observability and monitoring is also vital to understanding the utilization of your resources, ensuring that you do not overprovision your models and waste money on excess compute.Â
Logging
Like any service, comprehensive logging plays a vital role in ensuring the reliability, performance monitoring, and debugging of deployed machine learning models. Logging allows you to capture valuable information about the inference process, such as input data, model predictions, inference latency, errors, and system metrics. Through logging, organizations gain visibility into the behavior of deployed models in real-world scenarios, allowing them to monitor model performance, identify potential issues, and diagnose root causes of errors or discrepancies.Â
In addition, many industries such as health care or financial services require that logs are stored and made easily accessible in order to remain compliant with regulatory requirements and auditing standards. This can require not only storing the logs but a transparent audit of model predictions and the factors influencing decision-making processes. This type of regulation and scrutiny is only likely to increase as AI/ML industries continue to undergo new oversight and legislation from lawmakers. Lastly, logged data can be used for offline analysis, model retraining, and continuous improvement, providing organizations with another mechanism to iteratively enhance the accuracy and effectiveness of their deployed models. Â
Monitoring and Observability
Once a service is deployed and released into production, there are a number of factors outside of the application code itself that can lead to a service failing or experiencing degraded performance. Traffic could drastically spike, an anomalously large request could be received - eating up all the machine's available resources, or an external service within the model’s dependencies suddenly becomes unavailable. All of these scenarios are common occurrences within machine learning services, and it’s important to anticipate these challenges and have the tools in place to be able to identify them.Â
Monitoring the resource utilization and traffic metrics of your inference services will provide you with this insight. Monitoring involves emitting service metrics relating to resource utilization (CPU, memory, GPU) and traffic indicators (latency, throughput, error rates), ingesting these metrics in real-time, and displaying them visually so they can be interpreted and analyzed.Â
Popular open source frameworks such as Prometheus or Grafana provide powerful interfaces that can easily integrate into your applications, automatically track all of these resource and service metrics, and visualize them in a meaningful way. Machine learning models are computationally powerful, frequently rely heavily on GPU’s, and commonly max out resources. It’s important to understand the swings in your application’s resource utilization so you can make better decisions around node allocation, enable auto-scaling policies, and ultimately keep your applications from failing.Â
Alerting
Now that you have your service logs stored and accessible and your application metrics visualized, you don’t want to be sitting around staring at a bunch of log tails or slowly changing dashboards all day? No, that would be ridiculous. Our applications are going to run fine 99% of the time - *hopefully* 🙏 - we just want to be notified when things start to go bad. Alerting services do exactly this. There are a number of open source providers (Prometheus Alert Manager, OpenAPM), managed services (New Relic, DataDog, Splunk) and cloud provider solutions (AWS Alert Manager, GCP Cloud Observability) that easily integrate with your metric provider framework and notify you when something goes wrong.Â
These alerting services allow you to define threshold policies such as once CPU reaches 90% of 30 minutes or if throughput falls below 5 RPM for more than 1 hour. Once these threshold policies are triggered, you can integrate with a notification service like Slack or PagerDuty to receive a message or call informing an engineer that they need to investigate a service. Alerting services are crucial tools for any engineering team looking to scale their support offerings and provide the highest level of uptime and availability for their services.Â
How Does Qwak Help with Deployment?
Qwak is an end-to-end platform that handles all aspects of machine learning development - training, experiment tracking, inference deployment, monitoring and alerting, observability, automation, feature store, and vector store. For inference deployment, Qwak supports Batch, Real-Time, and Streaming use cases, making it extremely easy to migrate from one inference type to another. Once your inference service is deployed, Qwak automatically provides observability dashboards and service logs so you can easily monitor your service performance, make adjustments to cluster resources, or debug an issue.Â
Built-in alerting and notifications allow you to define custom thresholds, and receive alerts in email, Slack or PagerDuty when it's time to investigate a service. Automations, Qwak’s custom CI/CD tooling, allows you to define scheduling training and inference deployments that will automatically execute once changes to your code have been made. Qwak natively supports multiple variation model deployments as well as shadows, so you can intelligently A/B test and select the best variation of your model.Â
In addition, Qwak also maintains an analytics or inference lake that stores the inputs sent to your model as well as the predictions that the model makes. This analytics lake provides invaluable model data that will enables you to compare model performance, complete complex feature and model drift analysis, dynamic A/B testing strategies, and intelligently influence your CI/CD pipelines.Â