Machine Learning projects are becoming an increasingly important component of today’s search for more efficient and complex industrial processes. OVH Prescience is a Machine Learning platform that aims to ease the conception, deployment and serving of models in an industrial context. The system manages Machine Learning pipelines, from data ingestion to model monitoring. This involves the automation of data preprocessing, model selection, evaluation, and deployment within a scalable platform.
Prescience supports various kinds of problems, such as regression, classification, time series forecasting, and soon, anomaly detection. Problem resolution is handled through the use of both traditional ML models and neural networks.
Prescience is currently used at production scale to solve various challenges faced by OVH, and its alpha version is available to explore for free at OVH Labs. In this blog post, we will introduce Prescience, and walk you through a typical workflow along with its components. An in-depth presentation of the components will be available in future blog posts.
The inception of Prescience
At some point, all Machine Learning projects face the same challenge: how to bridge the gap between a prototype Machine Learning system, and its use in a production context. That was the cornerstone of the development of a Machine Learning platform within OVH.
Production notebooks
More often than not, data scientists design Machine Learning systems that include data processing and model selection within notebooks. If they’re successful, these notebooks are then adapted for production needs by data engineers or developers. This process is usually delicate. It is time-consuming, and must be repeated each time the model or the data processing requires an update. These issues lead to production models that, though ideal when delivered, might drift from the actual problem over time. In practice, it is common for models to never be used in a production capacity, in spite of their quality, just because the data pipeline is too complicated (or boring) to take out of the notebook. As a result, all the data scientists’ work goes to waste.
In light of this, the first problem Prescience needed to solve was how to provide a simple way to deploy and serve models, while allowing monitoring and efficient model management, including (but not limited to) model retraining, model evaluation or model querying through a serving REST API.
Enhanced prototyping
Once the gap between prototyping and production was bridged, the second objective was to shorten the prototyping phase of Machine Learning projects. The base observation is that data scientists’ skills are most crucial when applied to data preparation or feature engineering. Essentially, the data scientist’s job is to properly define the problem. This includes characterising the data, the actual target, and the correct metric for evaluation. Nonetheless, model selection is also a task handled by the data scientist – one which delivers a lot less value from this specialist. Indeed, one of the classics way of finding a good model and its parameters still is to brute-force all possible configurations within a given space. As a result, model selection can be quite painstaking and time-consuming.
Consequently, Prescience needed to provide data scientists with an efficient way to test and evaluate algorithms, which would allow them to focus on adding value to the data and problem definition. This was achieved by adding an optimisation component that, given a configuration space, evaluates and tests the configurations within it, regardless of whether they’ve been tweaked by the data scientist. The architecture being scalable, we can quickly test a significant number of possibilities in this way. The optimisation component also leverages techniques to try and outperform the brute-force approach, through the use of Bayesian optimisation. In addition, tested configurations for a given problem are preserved for later use, and to ease the start of the optimisation process.
Widening the possibilities
In a company such as OVH, a lot of concerns can be addressed with Machine Learning techniques. Unfortunately, it is not possible to assign a data scientist to each of these issues, especially if it has not been established whether the investment would be worthwhile. Even though our business specialists have not mastered all Machine Learning techniques, they have an extensive knowledge of the data. Building on this knowledge, they can provide us with a minimal definition of the problem at hand. Automating the previous steps (data preparation and model selection) enables specialists to swiftly evaluate the possible benefits of a Machine Learning approach. It is then possible to adopt a quick-win/quick-fail process for potential projects. If this is successful, we can bring a data scientist into the loop, if necessary.
Prescience also incorporates automated pipeline management, to adapt the raw data to be consumed by Machine Learning algorithms (i.e. preprocessing), then select a well-suited algorithm and its parameters (i.e. model selection), while retaining automatic deployment and monitoring.
Prescience architecture and Machine Learning workflows
Essentially, the Prescience platform is built upon open-source technologies, such as Kubernetes for operations, Spark for data processing, and Scikit-learn, XGBoost, Spark MLlib and Tensorflow for Machine Learning libraries. Most of Prescience’s development involved linking these technologies together. In addition, all intermediate outputs of the system – such as pre-processed data, transformation steps, or models – are serialised using open-source technologies and standards. This prevents users from being tethered to Prescience, in case it ever becomes necessary to use another system.
User interaction with the Prescience platform is made possible through the following elements:
- user interface
- python client
- REST API
Let’s take a look at a typical workflow, and give a brief description of the different components…
Data ingestion
The first step of a Machine Learning workflow is to ingest user data. We currently support three types of source, which will then be extended, depending on usage:
- CSV, the industry standard
- Parquet, which is pretty cool (plus auto-documented and compressed)
- Time-Series, thanks to OVH Observability, powered by Warp10
The raw data provided by each of these sources is rarely usable as-is by Machine Learning algorithms. Algorithms generally expect numbers to work with. The first step of the workflow is therefore performed by the Parser component. The Parser’s only job is to detect types and column names, in the case of plain text formats, such as CSV, although Parquet and Warp10 sources include a schema, making this step moot. Once the data is typed, the Parser extracts statistics, in order to precisely characterise it. The resulting data, along with its statistics, is stored in our object storage backend – Public Cloud Storage, powered by OpenStack Swift.
Data transformation
Once the types are inferred and the statistics extracted, the data still usually needs to be processed before it’s Machine Learning-ready. This step is handled by the preprocessor. Relying on the computed statistics and problem type, it infers the best strategy to apply to the source. For instance, if we have a single category, then a one-hot-encoding is performed. However if we have a a large number of different categories, then a more suited processing type is selected, such as level/impact coding. After being inferred, the strategy is applied to the source, transforming it into a dataset, which will be the basis of the subsequent model selection step.
The preprocessor not only outputs the dataset, but also a serialised version of the transformations. The chosen format for the serialisation is the PMML (Predictive Model Markup Language). This is a description standard to share and exchange data mining and Machine Learning algorithms. Using this format, we will then be able to apply the exact same transformation at serving time, when confronted with new data.
Model selection
Once a dataset is ready, the next step is to try and fit the best model. Depending on the problem, a set of algorithms, along with their configuration space, is provided to the user. Depending on their skill level, the user can tweak the configuration space and preselect a subset of algorithms that better fit the problem.
Bayesian optimisation
The component that handles the optimisation and the model selection is the optimisation engine. When starting an optimisation, a sub-component called the controller creates an internal optimisation task. The controller handles the scheduling of the various optimisation steps performed during the task. The optimisation is achieved using Bayesian methods. Basically, a Bayesian approach consists in learning a model that will be able to predict what is the best configuration. We can break down the steps as follows:
- The model is in a cold state. The optimiser returns the default set of initial configurations to the controller
- The controller distributes the initial configurations over a cluster of learners
- Upon completion of the initial configurations, the controller stores the results
- The optimiser starts its second iteration, and trains a model on the available data
- Based on the resulting model, the optimiser outputs the best challengers to try. Both their potential efficiency, and the amount of information it will provide to improve the selection model are considered
- The controller distributes the new set of configurations over the cluster and waits for new information, a.k.a newly-evaluated configurations. Configurations are evaluated using a K-fold cross-validation, to avoid overfitting.
- When new information is available, a new optimisation iteration is started, and the process begins again at step 4
- After a predefined number of iterations, the optimisation stops
Model validation
Once optimisation is completed, the user can either launch a new optimisation, leveraging the existing data (hence not starting back to the cold state), or select a configuration according to its evaluation scores. Once a suitable configuration is reached, it is used to train the final model, which is then serialised in either a PMML format, or the Tensorflow saved model format. The same learners that handled the evaluations perform the actual training.
Eventually, the final model is evaluated against a test set, extracted during preprocessing . This set is never used during model selection or training, to ensure that computed scoring metrics are unbiased. Based on the resulting scoring metrics, the decision can be made to use the model in production or not.
Model serving
At this stage, the model is trained and exported and ready to be served. The last component of the Prescience platform is Prescience Serving. This is a web service that consumes PMML and saved models, and exposes a REST API on top. As transformations are exported alongside the model, the user can query the newly deployed model using the raw data. Predictions are now ready to be used within any application.
Model monitoring
In addition, one of the characteristic features of Machine Learning is its ability to adapt itself to new data. Contrary to traditional, hardcoded business rules, the Machine Learning model is able to adapt to the underlying patterns. To do this, the Prescience platform enables users to easily update sources, refresh datasets, and retrain models. These lifecycle steps help maintain model relevance regarding the problem that needs solving. The user can then match its retraining frequency with newly-qualified data generation. They can even interrupt the training process in the event of an anomaly in the data generation pipeline. Each time a model is retrained, a new set of scores is computed, and stored in OVH Observability for monitoring.
As we outlined at the beginning of this blog post, having an accurate model does not give any guarantees about its ability to maintain this accuracy over time. For numerous reasons, model performance can weaken. For example, the raw data can decrease in quality, some anomalies can appear in the data engineering pipelines, or the problem itself can drift, rendering the current model irrelevant, even after retraining. It is therefore essential to continuously monitor model performance throughout the entire lifecycle, to avoid making decisions based on an obsolete model.
The move towards an AI-driven company
Prescience is currently used at OVH to solve several industrial problems, such as fraud prevention and predictive maintenance in datacentres.
With this platform, we plan on empowering more and more teams and services at OVH with the ability to optimise their processes through Machine Learning. We are particularly excited about our work with Time Series, which has a decisive role in the operation and monitoring of hundreds of thousands of servers and virtual machines.
The development of Prescience is conducted by the Machine Learning Services team. MLS is composed of four Machine Learning engineers: Mael, Adrien, Raphael, and myself. The team is supervised by Guillaume, who helped me design the platform. In addition, the team includes two data scientists, Olivier and Clement, who handled internal use cases and provided us with feedback, and finally, Laurent: a CIFRE student working on multi-objective optimisation in Prescience, in collaboration with the ORKAD Research Team.
Machine Learning Engineer