Top ML Serving Tools

Deploying an ML model in production is a complicated process. ML serving tools help serve models in a scalable way with optimized serving architecture.
Ran Romano
Ran Romano
Co-founder & CPO at Qwak
January 18, 2023
Contents
Top ML Serving Tools

Modern businesses rely on machine learning models more than ever to run their operations better. Machine learning models help organizations to reduce customer acquisition costs through focused marketing. They help companies to serve their customers better by anticipating customer needs and providing personalized recommendations. Machine learning helps reduce operational costs by optimizing logistics, demand planning, etc. They help automate tasks that were otherwise considered possible only by human beings. Behind these great benefits offered by machine learning, there is a complex sequence of actions and an even more complicated architecture that works all the time. Even though building the model is itself a difficult task, that is just the tip of the iceberg when it comes to deriving value out of machine learning. 

After a good enough model is developed, an organization goes through numerous activities to deploy it into production and operationalize it. This includes finalizing the deployment architecture, handling the continuous training, tracking the drift, defining tests to decide on model improvement and replacement, and finally continuous monitoring. Development teams build large systems to streamline this process of the build, deployment, and continuous improvement of models. Such systems are called ML Serving tools or MLOps tools.

An alternative to building custom MLOps systems is to use third-party Machine Learning Ops Frameworks. If you are looking for such a system, take a look at Qwak. It simplifies the productionization of machine learning models at scale. Qwak’s Feature Store and ML Platform empower data science and ML engineering teams to deliver ML models to production at scale. This article is about the top MLOps tools and how they compare with each other.

Need For ML Serving Frameworks

Taking a machine learning model to production involves numerous complex steps and logical systems. The first step of building any ML component is finalizing the model architecture. Finalizing model architecture involves trying out multiple architectures, and shortlisting a set of architectures that fit the purpose. Developers then deploy the shortlisted models to production and execute A/B testing to assert the accuracy. In most cases, there is already a rule-based decision-making system already in place and the model is a replacement for it. The deployed models are then evaluated against each other and the rule-based systems to decide the winner or the winners. 

Defining the architecture for model deployment is also a complex process. Machine learning models are essentially thousands of equations packed into a large data file. Running an inference from a machine learning model involves feeding input values to these equations and running them through thousands of layers of equations. Hence the inference process is a very resource-hungry one. The models are deployed in GPU machines to take advantage of the parallelism offered by them. The challenge here is to finalize a deployment architecture that optimizes cost as well as performance. 

Choosing an architecture to deploy the model involves answering many tough questions. What is the SLA for response time? Can a batch-based inference work? What kind of model frameworks should be supported? What are the scaling requirements? Will the inference load peak at certain times or stay relatively constant throughout the day? Once all these questions are answered, the development of the inference system can start. But then the answers to these questions do not stay constant all the time. Things change and the inference system goes through numerous iterations of architecture refinement. 

The process does not end with deploying the model. The models are continuously trained and evaluated against the already deployed model to decide on a version change. This is iterative, never engine process. The model also goes through numerous tests to identify data drift. Data drift happens when the actual data from the system gradually diverges from the data on which the model was trained. In the meantime, developers keep improving the model by refining the existing features or adding more features. This leads to further changes in deployment. 

Managing all the activities mentioned above through a manual process is not practical. Hence organizations often devise frameworks to automate this process as much as possible. This is where MLOps frameworks products can make a difference. These MLOps frameworks or MLOps tools can automate most of the model management activities and provide reports that developers can use to take decisions. They can also keep track of data versions and model versions. MLOps frameworks or serving tools are provided by companies with years of experience in the machine-learning space and hence can save a lot of learning time for organizations just getting started in this space. Relying on such products is a better alternative than developing custom tools. 

Must-Have Features In An MLOps Tool

An ideal MLOps tool manages all the activities in the lifecycle of a machine learning system. On a high level, it must have the following features.

Data Versioning

A model’s accuracy is mainly dependent on the data on which it is trained. Two models of the same architecture trained on the slightest variation of data may give different results. The models are often used in critical applications and even the smallest unexplainable change in inference output for the test data can lead to a model being rejected. Hence keeping track of the data on which the model is trained is an important step in the model lifecycle. Typically throughout the lifecycle of an ML system, models get retrained frequently as and when new tagged data from the live system becomes available. Once a new model is trained, developers compare the test results with the previous model’s results and if there is significant confidence that the model output has been improved, it is pushed to production. Else, the model is retained in the repository with a tagged data version. MLOps tools must provide a facility for storing version information.

Model versioning

New training data is not the only way to develop a new version of the model. Even with the same base architecture, new versions of the model can be developed by using different hyperparameters. For example, you can change the activation function of a model and it can lead to an entirely different model with different results in the test data. A slight change in the learning rate or dropout ratio can lead to different results for good or bad. Of course, one can create a new model by using an entirely different architecture. The point is, an active machine learning team keeps changing the model and releases variations of models quite frequently. A good MLOps tool must be able to keep track of it and upgrade the model in production or even revert it seamlessly.

Support For Multiple Inference Strategies

Choosing an inference strategy is primarily done based on cost and performance requirements. Some requirements may need real-time inference with no possibility of waiting to create a reasonable batch size. In cases with high traffic, even real-time inference may have to be executed based on optimal batch sizes to achieve the highest throughput. Then there are cases, where inference can be scheduled for specific time slots. An ML serving tool must support different inference strategies and allow developers to choose based on their requirements.

Continuous deployment

Continious deployment of models is very different from the build and deployment of code assets. As mentioned earlier, a model in its most basic form is a data file with weight parameters for forming mathematical equations. The typical deployment process of models involves retrieving the model file and pushing it to a location where the inference engine can access it. The process ends with changing the configuration of the inference engine to point to the new file. Continuous deployment and integration are must-have features for an ideal MLOps tool.

Tracking Model Drift And Model Observability

Model drift is a result of the live data distribution undergoing slight variations from the data in which the model was trained on. Model drift can be detected by continuously testing the model on production data and comparing the results with the predefined test data results. The results from production will typically be worse than the test data results, but the variation must be within the statistical limits. There are many statistical tests to determine if the variation is increasing and getting out of control. Kullback–Leibler test, Jensen-Shannon test, etc are popular statistical tests to identify model drift.

Support For A/B Testing

Developers often come up with multiple models with good enough accuracy on test data. Slight variations in accuracy on test data are not a good indicator of the supremacy of a model over another, because it may well be that they perform in a different manner when exposed to production data. Hence machine learning teams often deploy multiple models to production and use actual customer behavior-based indicators to decide the best one. This is called A/B testing. A good ML serving tool must have the ability to conduct A/B tests and adjust routing ratios based on developer requirements. It should also be able to report metrics based on A/B testing and streamline the decision-making process.

Hyperparameter tuning

Getting the best out of your finalized model architecture often involves optimizing hyperparameters. Model Ops tools enable hyperparameter optimization through iterative validation exercises. Keeping track of model accuracy for various hyperparameter combinations is a must-have feature for a model serving framework. 

Monitoring

Model monitoring involves keeping track of the model's accuracy to detect any sudden changes in model performance. Model performance can change drastically due to many factors like feature pre-processing bugs, concept drift, etc. A good model serving tool must be able to continuously monitor this and raise alerts when models degrade beyond threshold values. 

Scaling

Models the varying amounts of load at different times of the day. Since serving models is a resource-intensive process, it is not a good practice to have them hosted at the highest configuration to take the maximum load at all times. There will be times during the day when load patterns are minimal and organizations can get away with lower configurations. A good model serving tool must be able to scale dynamically based on laid patterns to optimize cost.

Feature Store

A feature store is a place where features required by the model are stored in an easily accessible way. Machine learning teams play around with features all the time and experiment with different combinations. A central shareable feature store helps organizations to maintain feature quality and ensure that all teams work with the same data. Synchronizing features between batch and real-time inference modes is another desirable facility. Ideal ML serving tools contains a built-in dependable feature store. 

Top MLOps Tools

MLOps tools is a thriving area and many great products with all the mentioned features exist in this space. You will now look at some of the top ones and explore how they compare with each other based on the features we discussed.

AWS SageMaker

AWS Sagemaker is a completely managed machine learning framework from Amazon web services. It can help developers train, test and deploy machine learning models with a completely managed infrastructure and workflows. Sagemaker supports data versioning and model versioning along with a comprehensive model monitoring facility. It can be used to implement A/B testing and supports hyperparameter tuning. Sagemaker has an automated model-tuning feature that takes control from developers and provides the best hyperparameter after a series of training runs. 

Sagemaker supports multiple inference strategies - Real-time and batch. It can be configured to scale automatically. Sagemaker also comes with a completely managed feature store implementation that can store feature data in different formats. 

Sagemaker provides a feature called Autopilot that reduces developer effort in creating models. While using Autopilot, developers can simply provide input data and tag output data. Sagemaker will execute training runs with different architectures suited for the problem and come up with the best models and sort them in a leaderboard. Developers can then choose the best model and deploy the model in production with just a click. 

Sagemaker’s canvas feature enables developers to create unified datasets and chain models together to solve complex problems using a no-code approach. It provides a visual interface where developers can just point and click to get the desired results. 

Sagemaker’s data wrangler is another notable feature that reduces developer effort. Data wrangler helps developers to prepare data and pre-process it for training. Data wrangler simplifies the feature engineering part through an intuitive visual interface. Developers can use SQL to select, transform and import data. Data wrangler also provides a data quality and insight report that flags up anomalies, duplicate data, etc. It also supports more than 300 built-in data transformations that can be used without code.

Sagemaker deployment provides many options to strike the balance between cost and performance. It supports serverless deployment with zero maintenance. Realtime inference based on synchronous and asynchronous responses is supported. Models can be configured as single or multi-model endpoints. Multi container endpoint is also supported. Multiple containers are beneficial when there are models using different frameworks. 

Sagemaker deployment can also implement a serial model interface where the output from a model is fed to another model. Such must-step architectures are often used in computer vision and natural language processing problems. Sagemaker has an inference recommender feature that recommends the optimal inference engine configuration based on the problem definition. 

Sagemaker supports most machine learning frameworks like Tensorflow, Pytorch, MXnet, Hugging face, SKlearn, etc. It also supports serving frameworks like TensorFlow Serving, TorchServe, NVIDIA Triton, and AWS Multi-Model Server. A recent notable addition has been the support for Geospatial data. 

TF Serving

TF Serving is a high-performance model serving tool bundled as part of Google’s Tensorflow framework. Even though it primarily supports Tensorflow-based models, the architecture is designed such that other frameworks can also be used. TF Serving is not a completely managed model serving tool like Amazon Sagemaker. While it serves all the features required for serving the models, developers need to tinker a bit to configure and use it properly. TF Serve is suited for development teams that do not want to rely on third-party model serving tools and build their own deployment architecture. 

TF Serve primarily deals with the inference part and provides versioned access to models through a look-up table. It can serve multiple models simultaneously. It can also serve different versions of the same model simultaneously. This is helpful in cases where developers create models using the same architecture but by slightly varying training data or hyperparameters. TF Server can expose different kinds of serving endpoints like GRPC, HTTP, etc. It includes a built-in scheduler that can group requests together to form an automatic batch. This feature is very important while implementing high-performance inference engines that are exposed to very high traffic. It helps to achieve high throughput in such situations. 

TF Serving is not a completely managed solution and requires other components to work. The recommended option is to use it with Docker containers. TF Serving uses the concept of servables to abstract the model serving pipeline. A servable can be a lookup or an inference. It can support different kinds of tasks like embedding lookup, streaming results, etc. A servable is very flexible when it comes to size and granularity. It can be used to serve a small shard of a large look-up table or even a group of models in one instance. A servable stream is a sorted sequence of different versions of models. 

Servable can be combined resulting in a composite servable. One can also enclose a fraction of a model in a servable and combine them to create the full model. Serves do not manage their own life cycle. TF Serve uses the concept of Managers to manage the lifecycle of Servables. Loaders provide a standard set of APIs and common infrastructure configurations to initialize or terminate a model. Managers can take decisions regarding when to load or unload a model. It can control what version of the model is loaded and time lapsed till a model is unloaded when a new model comes in. 

TF Serve allows close control of batching configurations allowing developers to control the cost and performance in a granular manner. Developers can use parameters like batch time out, batch threads, enquire batches, etc to get the desired performance. 

Google Vertex AI

Vertex AI is a completely managed MLOps tool provided by Google Cloud Services. It is a unified framework with a data platform and AI platform bundled into one. Developers can use custom code trains to develop the models or take advantage of the AutoML feature to let Vertex studio control the training. Irrespective of whether models were trained using custom code or Auto ML, they can be deployed into the same endpoints. So one can start with an AutoML-trained model and refine it after the launch of the first version. Once developers gain confidence in their own models, they can change the deployed model with just a click of a button in the same URL.

Vertex AI supports all the open-source machine learning frameworks such as Tensorflow, Pytorch, sci-kit-learn, etc. Developers can also bring their custom containers and serve it through Vertex AI. Custom containers spare developers the concerns of software compatibility while using uncommon frameworks. Vertex AI comes with built-in support for monitoring. It supports automated alerts to detect data drift, concept drift, or any other problems that can degrade model performance. Vertex AI workbench is a single place where data scientists can execute all their ML work including experimentation, testing, deployment, and monitoring. It is built on top of Jupyter notebook with integration for Google’s infrastructure and user access controls.

Vertex AI provides a data labeling service where developers can take help from human labelers to accumulate training data. It has a built-in vector database that can store embeddings in a searchable manner. Vector databases are used to find similarities in applications such as recommendation engines. The Vertex ML metadata is Google's solution for tracking model lineage and versions. CI/CD facility is provided by Vertex AI pipelines. Vertex AI peipleinesa re built on top of kubeflow pipelines. It combines Google’s managed infrastructure with the flexibility of Kubernetes to provide a seamless scaling experience. 

The Explainable AI module of vertex AI brings in observability as well as reasoning to the model implementation. 

Vertex AI provides many options for hyperparameter tuning. Vertex AI training module comes with built-in algorithms to help developers bring the maximum out of their custom algorithms. The Tensorboard feature provides a visualization feature to aid ML experimentation using different hyperparameter combinations. Vertex AI vizier provides a block box hyperparamter tuning service. Vizier uses different search algorithms like grid search, linear combination search, etc to arrive at the optimal parameters. It works even when there is no known objective function. 

Even after having all these features, if you are still unable to get the desired results, give Vertex AI’s neural architectural search a try. It can search for new model architectures based on your application needs. 

MLflow

MLflow is an open-source platform to manage the entire machine learning lifecycle. It can manage experiments, deploy models, and host a central model registry. It is library agnostic and supports all ML libraries and frameworks. It can be deployed in a cloud of your choice. It can handle big data problems using Apache spark as the processing backend. MLflow is a great choice when organizations do not want to get locked to a cloud provider like AWS and GCP and wants an abstraction that can be deployed in any cloud. MLflow has four components - Tracking, Projects, Models, and Model Registry. 

MLdlow tracking provides a set of APIs and a user interface to log hyperparameters, model versions, data versions, and result metrics while executing experiments. It also provides a dashboard to visualize the results. APIs are available for Java, Python, and R. It also supports Rest APIs if you are working in a language other than the three mentioned. MLflow abstracts the experiments using the concept of runs. It captures the code version, start time, end time, source, parameters, metrics, and artifacts details for each run. The information can be stored in an SQLAlchemy-compatible database or on a remote tracking server.

MLflow project is a logical grouping of all the code required to run a machine learning module. Projects are envisioned as reusable assets that can be chained together to create a complete flow. MLflow projects include a set of APIs and command line tools to manage the projects. Each project is defined by a name, entry points, and execution environment. The execution environment can be docker, a python virtualenv, or a conda environment. To tune hyperparameters, one can use MLflow runs and parallelly execute them with different hyper parameter combinations.

MLflow Models provide a standard format for bundling machine learning models. The models can be used by different kinds of downstream systems. They can be used for real streaming inference based on APIs or batch inference using spark. MLflow integrates with data bricks and Kubernetes to enable the batch inference model. MLflow uses the concept of flavors to integrate models based on different libraries. It provides a set of built-in flavors and the flexibility for users to implement custom flavors. MLflow registry is a centralized model store, that provides model versions, controlled staging to production transitions, and model lineage. The model registry uses annotations to mark the top-level model. It supports markdown language to add documentation about models. The model registry provides a set of APIs and a user interface to manage the models. The APIs can be used by CI/CD pipelines to automatically register models after training runs. 

BentoML

BentoML is an open-source platform to ease the work of data scientists and streamline the process of moving models to production. It can be deployed on-premise or in a cloud provider of your choice. BentoML provides a cloud-based service for organizations who want a completely managed experience. Bento supports parellel inference for maximum throughput. It can create batches dynamically to accommodate high-traffic scenarios and run the models on accelerated hardware. It supports all the common machine learning frameworks and provides a standardized model packaging mechanism, There is a built-in model store called Yatai model registry.

Bento packages are deployed using docker containers. It can be done through the Yatai user interface or through the APIs provided by BentoML. 

BentoML focuses specifically on production workloads and does not offer much for experimentation and other life cycle steps of model development, It provides a set of API and command line tools to manage the model registry and deployment. The proprietary packaging structure is a file archive with source code, models, datasets, and dependency configurations. Bento archives can be pushed to the Yatai model registry through APIs or the user interface. The files can be stored in a third-party storage service like AWS S3, GCS, or a MinIO instance. 

The models are served using the concept of runners. BentoML comes with a set of built-in runners for popular frameworks and provides the flexibility to implement custom runners. The runners can make use of GPU when available and automatically set performance parameters like the number of threads and workers. Runners are capable of running multiple models at the same time. Runners can use a sequential execution mode or parallel execution mode. Sequential runs are used in the case of multi-step model implementations. Parallel runs are used when results from two models need to be aggregated. 

BentoML provides options for adding tracing mechanisms to model serving APIs. Tracing is the method of correlating microservice calls to provide information for debugging in the case of complex architecture with nested service access calls.  It supports well-known tracing frameworks like Zipkin, Jaeger, OpenTelemetry, etc. 

Since BentoML focuses only on production serving, developers need to find alternatives for handling the other parts of the model development lifecycle. BentoML supports integration with numerous data platforms and ML platforms for this purpose. It supports Apache Airflow - A workflow management application and Apache Flink - A stream processing framework.

It can also integrate with MLOps tools like MLFLow and Arize AI.

Triton Inference Server

Triton is an open-source inference server from Nvidia. Triton primarily focuses on production model ops and aims to standardize model packaging. It can be deployed on-premise or to a cloud of your choice. Triton supports all major machine learning frameworks such as Tensorflow, Pytorch, Scikit learn, etc. It supports GPU-based inference based on Nvidia, ARM, and AWS Inferentia. It has built-in support for audio and video streams. Triton supports dynamic batching, real-time synchronous inference, and batch-based inference. Triton can execute multiple models on a single CPU, GPU, or multi-GPU cluster. In cases a model is too large to be held in a single GPU, it supports multi-node inference. 

Triton is implemented based on a filesystem-based repository of models. Triton provides a set of APIs for model management and deployment. Models are served based on HTTP or GRPC. If your model framework is not one that is supported by Triton, it provides options to implement your own backend through a C API. Triton’s ace up the sleeve is its excellent inference scheduling capabilities. This allows it to execute multiple models in parallel and achieve high throughput. Triton uses the concept of instance groups to define the number of parallel executions supported by each model. 

Triton’s ability to execute model inference parallel complicates things in the case of stateful models. Stateful models store information from the previous inference and use it as input for the next inference. This is used when the inference needs to be done based on a sequence of steps taken by the subject. Triton uses the concept of sequence batcher to handle this scenario. It ensures that all requests from a subject reach the same model even when parallel execution mode is on. 

Triton comes with a performance analyzer and a model analyzer. These applications can generate inference requests and analyze the performance of the engine. It then measures the latency and throughput over a time window and arrives at metrics. Triton can integrate with other MLOps frameworks to address the components of the model cycle other than production workloads. It uses Kubernetes for orchestration and can use Prometheus for model monitoring. It can integrate with MLflow through a plugin.

Triton mainly focuses on the production deployment part and does not make any effort to enable a no-code approach like other MLOps tools. It encourages developers to use its APIs and command line tools rather than the user interface. Triton’s value proposition is its high throughput and close control of serving configurations. 

Kubeflow pipelines

Kubefow pipeline is a framework for building and deploying machine learning models based on docker containers. While it can be used as an ML serving tool, it does not handle other parts of the machine learning lifecycle such as autoML, automated hyperparameter optimization, etc. It provides a user interface for tracking experiments, and model inference runs. Developers can use Jupyter notebooks to interact with the system using Kubeflow SDK components. The inference engine that comes as part of Kubeflow can execute multi-step model inference pipelines. Kubeflow’s objective is to handle the end-to-end orchestration of ML deployments, and facilitate component reuse, and quick experiments.

Kubeflow defines a machine learning workflow in terms of a pipeline. A pipeline is a collection of components and a run time execution graph connecting the components. The pipelines can be defined using Python. They can be visualized in the Kubeflow user interface. A component in pipelines is an independent code snippet that performs a single task like data preprocessing, transformation or training. The components are packaged as docker containers. An experiment is an ad hoc configuration that facilitates a run of the pipeline. Kubeflow captures model metadata using Google’s ml-metadata format. 

Kubeflow can integrate with other applications to handle the lifecycle events of model building and other production deployments. For example, it can integrate with Katib, a cloud-native project for automated machine learning. It can integrate with Prometheus for capturing the metrics. 

Qwak

Qwak is a fully managed MLOps tool that supports all activities of the machine learning model development lifecycle. It can help one transform and store data, train, deploy and monitor models. Qwak can help to track experiments and promote the best model among the results to production. Qwak has a built in feature store. It also supports automated monitoring. 

Conclusion

MLOps frameworks help to formalize the steps required to take a machine learning model from experimentation to production. MLOps frameworks keep track of experiments, log their metrics, help conduct A/B tests and deploy machine learning models with high throughput. They also facilitate the monitoring of models to identify data drift, concept drift, or any problems that can cause model output degradation. This article mentions seven top MLOps tools. Some of them are completely managed services that even provide a no-code approach to building machine learning models. Some of them are open source and can be deployed in any cloud of your choice. The Open source MLOps tools need integration with other applications to handle the full life cycle of machine learning development. They need eternal integration for features like automatic machine learning, feature store, and vector databases. Completely managed services like Sagemaker and Vertex AI keeps organizations locked to specific cloud providers. If you are looking for a more flexible MLOPs tool with all the must-have features and more, consider Qwak.

Qwak simplifies the process of experimentation and deployment of machine learning models at scale. Qwak’s Feature Store and ML Platform help engineering teams to embrace agility in the true sense while developing machine learning pipelines. 

Chat with us to see the platform live and discover how we can help simplify your journey deploying AI in production.

say goodbe to complex mlops with Qwak