Improving the quality of data with Apache Spark

Today we are proposing you a guest post by Hubert Stefani, Chief Innovation Officer and Cofounder of Novagen Conseil

As data consultant experts and heavy Apache Spark users, we felt honoured to become early adopters of OVHcloudData Processing. As a first use case to test this offering, we chose our quality assessment process.

As a data consultancy company based in Paris, we build complete and innovative data strategies for our large corporate and public customers: the top fortune banks, public authorities, retailers, fashion industry, transportation leaders etc. We offer them huge scale BI, data lake creation and management, business innovation with data science. Within our Data Lab, we select the best-in-class technology and create what we call ‘boosters’ ie ready to-deploy or customized data assets.

When it comes to selecting a new technology solution, we have the following check list:

Innovation and evolutivity: depth of functionalities, additional value and usability
Performance and cost-effectiveness: intrinsic performances, but also technical architectures that adapt to customer needs
Open standards and governance: to support our customers’ cloud or multi-cloud strategies, we choose to rely on open standards to deploy on different targets and preserve reversibility.

Apache Spark, our Swiss Army knife

About a month ago OVHcloud’s Data and AI Product Manager, Bastien Verdebout approached us to test its new product OVHcloud Data Processing, built on top of Apache Spark. The answer was of course yes!

One of the reasons we felt so eager to discover this data processing as a service solution was that we have an extensive usage of Apache Spark; it’s our our Swiss Army knife to process data.

It works on extremely high scale of data,
It meets the needs of data engineering and data science,
It allows the processing of data at rest and data streaming
It’s the de facto standard for data workloads on-premises and in the Cloud,
It offers built-in APIs for Python, Scala, Java and R.

We have progressively developped software assets on top of Apache Spark to address recurring challenges such as:

ETL processing in data lake environnements,
Quality KPIs on top of data lake sources,
Machine Learning algorithm for Natural Language Processing, Time Series predictions…

The ideal use case: data quality assessment

We have considered the following charateristics of OVHCloud Data processing:

Processing engine built on top of Apache Spark 2.4.3
Jobs start after a few seconds (vs minutes to launch a cluster)
Ability to adjust power dedicated to different Spark jobs: start with low power (1 driver and 1 executor with 4 cores and 8Gb of memory) to high scale processing (potential hundreds of cores and Gb of memories)
A full Compute/Storage separation aligned with standard of cloud architectures, including S3 APIs to access data stored in Object Storage layer.
Jobs execution and monitoring through Command Line Interface and API

These characteristics led us to choose our quality assessment process as an ideal use case which requires both interactivity and adjustable compute resources to deliver quality KPIs through Spark processes.

OVHCloud Data Processing at work

The corresponding command generated by our software is:

./ovh-spark-submit --projectid ec7d2cb6da084055a0501b2d8d8d62a1 \
  --class tech.novagen.spark.Launcher --driver-cores 4 --driver-memory 8G \
  --executor-cores 4 --executor-memory 8G --num-executors 5 \ 
  swift://sparkjars/QualitySparkExecutor-1.0-spark.jar --apiServer=5.1.1.2:80

We have a command which is quite similar to a usual spark-submit, except for the jar path, which requires the binary to be in an Object Storage bucket that we access with swift url specification. (NB: this command could have been created with a call to the OVHCloud Data Processing API).

Starting from this point, we can now fine tune our process portfolio and play with the allocation of different power with little limitation (except for the quotas assigned to any public cloud project).

Real-time display of job logs

In the end, for tuning and post-mortem job analysis, we can take advantage of the saved log files. It is noteworthy that OVHcloud Data Processing offers a real time display of job logs which is very convenient and provides a complementary supervision through Grafana dashboards.

This is a first yet significant test of OVHcloud Data Processing; so far, it proved an excellent match with the Novagen quality process use case and allowed us to validate several crucial points when it comes to testing a new data solution:

This is the beginning of this product, and we will have a close look at the upcoming functionalities. The OVHCloud team unveiled part of its roadmap, and it looks really promising!
Hubert Stefani, Chief Innovation Officer of Novagen Conseil

Hubert Stefani

Hubert has spent most of his career in tech consultancy companies, as Java & Web expert, committed to bringing innovation to many demanding customers.

Later CTO of a startup, Hubert has developed over the past 10 years, a very substantial expertise in Big Data and Cloud IT ecosystems. Hubert is the Chief Innovation Officer and co-founder of Novagen Conseil since 2016.