Today, we are excited to announce the release of a new Public Cloud product. To speed up and simplify your Machine Learning projects, we are introducing the ML Serving.
At OVHcloud, we use multiple Machine Learning models that help decision making; ranging from fighting fraud to enhancing our infrastructure maintenance. In a recent blog post we presented a Machine Learning platform that speeds up model prototyping. One of the key features of the platform is its ability to directly expose prototypes through a reliable web API.
However, for specific use cases – or to explore state of the art ML components – practitioners cannot limit themselves to using this platform alone. These projects, therefore, face one of the main challenges of Machine Learning: evolving from the prototype stage to an actual production grade system, which is reliable and accessible for business applications. To bridge this gap, we developed the ML Serving – a tool for handling the industrialization process.
Taking advantage of standard Open Source formats – such as the Tensorflow SavedModel – the ML Serving enables users to easily deploy their models while benefiting from essential features including instrumentation, scalability and model versioning. In this article we introduce the tools key components and features.
ML Serving Design
The ML Serving was designed to satisfy the following requirements:
- Ease of use
- Model versioning
- Deployment strategies
- Ease of access
- High availability
- Reversibility
- Security
To achieve these goals, we based our design on a microservice architecture relying on a Kubernetes infrastructure. The project is split into two main components:
- The Serving Runtime: handles the actual model inference.
- The Serving Hub: handles model orchestration and management
Serving Runtime
The Serving Runtime component is stand-alone and aims to address ease of use and ease of access. Thus, its two main tasks are to load models and to expose them.
Ease of access
We found the simplest way to expose the model was through a HTTP API. It is then easily integrated into any business application. The Serving Runtime provides an API limited to what is essential – an endpoint to describe the model and another to evaluate it.
The default API inputs are formatted as JSON tensors. Below is an example of an input payload for the Titanic dataset when evaluating two passengers chances of survival:
{
"fare" : [ [ 7 ], [ 9 ] ],
"sex" : [ [ "male" ], [ "female" ] ],
"embarked" : [ [ "Q"], ["S" ] ],
"pclass" : [ [ "3" ], [ "2" ] ],
"age" : [ [ 27 ], [ 38 ] ]
}
Tensors are used to address all serving models in a universal way. However, the representation may be difficult to grasp and can be simplified in some cases. In the event of the inference for only one passenger, the JSON looks like this:
{
"fare" : 7,
"sex" : "male",
"embarked" : "Q",
"pclass" :"3",
"age" : 27
}
The ML Serving also supports common image formats – such as JPG and PNG.
Ease of use
Now that we have defined an API, we need to load the models and evaluate them. Models are load based on serialization standards, of which the following are supported:
- Tensorflow SavedModel
- HDF5
- ONNX
- PMML
A dedicated module wrapped in a common interface supports each of these formats. If additional formats were to emerge, or were required, they could easily be added as new modules. By relying solely on standard formats, the system enables reversibility. This means ML practitioners can use these in a variety of languages and libraries outside the Runtime.
Flexibility
The Serving Runtime can be run directly from the exported model files or from a more flexible entrypoint – a manifest. A manifest is a description of a sequence of several models or evaluators. Using a manifest, users can describe a sequence of evaluators which outputs can be fed directly to subsequent evaluators. In other words, practitioners can combine them in a unique model deployment.
A common use case for this kind of combination includes preprocessing of the data based on custom evaluators before feeding the actual model, or tokenizing sentences before evaluating some neural net.
This part is still experimental and has proven challenging in terms of maintaining ease of use and reversibility. Still, it will prove essential in some use cases, e.g. tokenization of sentences before evaluating a BERT based model.
Serving Hub
The Serving Hub component aims to address all production grade requirements: versioning, security, deployment and high availability.
Its main tasks are packaging the Serving Runtime with the exported models into a Docker image and then deploying them over a Kubernetes cluster. Globally the component relies heavily on its Kubernetes features. For instance, the Hub does not require a database as all information regarding models and deployments is stored as Kubernetes Custom resources.
The Serving Hub exposes two APIs – an authentication API, which secures model access and mangement via JWT tokens, and a second one devoted to model management itself.
When using the model management API, deploying a model is as simple as providing a name and a path to the model files in the storage backend. The Serving Hub then triggers a workflow to build a Docker image, this packages the file with an instance of the Serving Runtime before pushing it to a registry. The image is stand-alone and can simply be run using Docker.
The Serving Hub provides versioning through the tagging of model images; it also handles deployments using built-in Kubernetes capabilities; setups all the model instrumentation; and provides automated model scaling. Whenever a model is overloaded, a new instance is spawned – this pulls the image from the registry to help withstand the new workload. Once the the workload decreases, the additional instances are removed and the consumed resources adapt to the actual workload.
Preset Images
While some models can only be business specific, others can be shared. To this extent the Serving Hub also manages a list of pre-built models to satisfy common use cases.
Currently, two models are available on OVHCloud: a sentiment analysis for English language and another one for French language. Both are based on the famous BERT (and its French counterpart CamemBERT) architectures, powered by Hugging Face, and fine tuned for sentiment analysis using the Stanford Sentiment Treebank(1).
More models will eventually expand the current list of preset images, so please contact us if you think a specific task would be a useful addition (like some of the models available on the OVHcloud AI MarketPlace).
Next Steps
The ML Serving currently provides fundamental features for any production system. Additionally, however, ML systems have their own particularities and additional features that monitor the global health and performance of any forthcoming models.
We are pursuing a feature at the moment that is able to monitor the potential concept drift within models. Detecting the drift concept early on potentially allows for preventive retraining or refining of a production model – reducing the risk of making decisions based on bad predictions.
Another important feature for a production is interpretability. While usually computationally costly, interpretability can be crucial; either for debugging purposes (like understanding why the model behaves so poorly on specific samples) or from a business perspective. Indeed, it is not always desirable to blindly trust an algorithm without perceiving its limitation or decision process.
OVHcloud ML Serving:
The ML Serving is available as an OVHcloud product within the Public Cloud management interface. Come and deploy your models or try out our preset models, and remember to share your feedback with us on the OVHcloud AI Gitter.
References
(1) Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher, 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of the 2013 conference on empirical methods in natural language processing,1631–1642
Machine Learning Engineer