Data Processing Archives - OVHcloud Blog

How to tackle data problems

Philip Marais — Thu, 17 Feb 2022 16:29:14 +0000

Let’s say you have a data problem…

Maybe you want to extract some value from your data, for example to give your app an advantage in the market. You look at all the data you have stored, and at the moment it looks like the inside of your house during lockdown… You did a search on what tools to use and you ended up on this page, so you got more questions than answers. As CFO you can speak some of the data lingo but you are not going to be mastering advanced statistics or machine learning models any time soon.

So, what do you do?

Well, the first step is to make sure you understand what the problem is you are trying to solve. Is it an AI problem or could it be solved with a simpler analytics tool? Presumably you have a business problem, and you hope that you will find the answer in your chaotic data. You need to phrase this business problem as a data problem in order to identify what tool to use to extract the answer.

Do you need a machine to solve your problem like a human would? Then you have a problem that artificial intelligence (AI) could assist with. Or do you need an algorithm that learns from examples without being explicitly programmed? Then you have a problem that machine learning (ML) could assist with. Or do you need to deliver some pretty graphs once you have cleaned up your data? Then you have a problem that business intelligence (BI) tools could assist with. Or can the solution be delivered by programming a few rules? Then you have a problem that could be solved with conventional programming. These are some of the questions you need to answer.

If you just need to try and determine what happened from your data (i.e., be descriptive) you may be able to do this with a BI tool or conventional programming. You may need a machine learning solution if you need to determine what will happen (i.e., obtain a prediction) or what to do (i.e., provide prescriptions) from your data. Typical predictive questions may be “what music will my customer want to download through my app next month?” or “how much discount should I provide to my customer next week?”. Typical prescriptive questions would be “how should we segment our customers?” or “what topics are in our customer feedback?”.

Aligning objectives and expectations

Of course, the defined problem needs to align with your business objectives and expectations. What are your strategic objectives, what are the timeframes for delivery and are the risks understood? Do you have the necessary skills and resources for implementation? Lack of alignment could hamper reaching the objective further down the process.

The above questions will all help you to translate your business objectives into data related objectives and goals and define your problem statement. Your problem statement will define the problem, the impact it has and what has been identified as the best starting point to solve the problem. The starting point will include one or more of your data related goals.

Remember that your objective needs to be specific. An objective like, “We want to use AI to make our product better that the competition”, does not really communicate what problem you are trying to solve and is unlikely to align with business objectives. An objective like, “We want to predict our customers’ next product choices in order to give them a better customer experience” helps to clarify the business problem and the data problem.

Choosing the right technology

Once the above questions have been answered, the objectives are clear and when the problem is well defined it is then much easier to choose the correct technology to address the problem. From there you can dive deeper into the specific data requirements and resource needs to use that technology. If you are interested to find out more about mastering your data, we are running an AI and data focused event on 10 March 2022 that includes a roundtable on this topic and more. The key takeaways will be:

Find out more about the event and sign up here or you can find out more about our managed cloud solutions here.

To find out more about the OVHcloud Startup Program, or to sign up for free credits and technical support, go to https://startup.ovhcloud.com/

The new Logs Data Platform

Carington Lucas Saint George — Mon, 26 Apr 2021 11:12:00 +0000

At OVHcloud, we see many different use cases around log management. In addition, we have recently had the opportunity to discuss with several client companies with different approaches and maturity on these topics. Based on these insights, we have made major changes to our LDP solution to address these new ways of consuming logs.

In this blog post we will see the historical features that made the success of this platform and the improvements resulting from your feedback.

Let’s start from the beginning

Why are logs important?

It’s critical you know at anytime, what’s going on into your Information System and applications.. You must be able to understand, analyze and monitor the health of your IT and applications to fix any issue. Log files are created each time an event occurs on your computer systems. They therefore provide useful information about the state of your services and infrastructure. These files come from different sources (physical servers, virtual servers, mobile applications, websites, etc.) and are counted in the millions every day, making them difficult to analyze. From your logs, you can extract valuable information about the behavior of your customers or your systems, and then act accordingly.

How do I analyze my logs?

With so many logs to collect, store and analyse, we know most IT administrators need turnkey integrated solution. That what we had in mind when we designed our platform. Indeed, Logs Data Platform is a turnkey solution, which allows you to collect, store and analyze logs. It supports the different log files, whether they are related to applications, servers or security. You have the possibility to use the log collector of your choice (Syslog-ng, Fluentd, NXLog) or to use our dedicated collectors (Flowgger and Logstash). These collectors work regardless of the source, format or structure of your data. For analysis, you can rely on visualization via Graylog, Kibana or Grafana. For example, in the context of infrastructure supervision, you can monitor logs at the application or server level. By making you benefit from the ELK (Elasticsearch Logstash Kibana) ecosystem, Logs Data Platform is a powerful log analysis solution.

3 ways to use Logs Data Platform

We saw that we could diffrenciate 3 major way to store and analyze logs, depending on the business usecase. So we decided that for each log stream you collect, you can activate any of the 3 following approach (or all 3 at the same time) :

WebSocket broadcasting allows you to see what’s going on in your application or server in real time. Indeed with this feature our Logs Data Platform allows you to connect different applications or servers to one unique endpoint and make all of them appear in one stream if needed. ldp-tail is able to follow one your stream in real-time with sub-second latency.

Follow the guide!

Logs Data Platform allows you to index your logs with a flexible retention period ranging from 14 days to 1 year, which allows you to analyze the data over a given period of time.

Follow the guide!

You also have the possibility to archive your logs for a long period of time (from 1 year to 10 years) thanks to the Cold Storage feature of Logs Data Platform. This can be very useful for example within the framework of the GDPR requirements or simply to keep the log history of your infrastructure.

Follow the guide!

How does it work?

Now, I think you see broadly how you can leverage the platform. Let’s dig deeper on the technologies that power it and how you can leverage them for ingest, query and analysis.

Ingestion

Logs Data Platform is compatible with most of the standard protocols on the market: GELF, SYSLOG, Cap’n’Proto, LTSV, RFC5425, Beats…

Moreover you also have the possibility to subscribe to dedicated collectors such as Logstash or Flowgger for more flexibility.

Query

If you have chosen to index your logs, then you have different ways to analyze the results: you can choose to use one of the visualization tools provided by OVHcloud (Graylog, Grafana or Kibana) or use the Elasticsearch, Graylog or OVHcloud APIs in order to use your own analysis tools.

And the little bonus

Logs Data Platform also allows you to index data other than logs thanks to its Index as a Service feature based on Elasticsearch, you can for example index documents.
Thanks to this feature you can for example create powerful search engines thanks to the performance of Elasticsearch and all this without worrying about the integration of Elasticsearch because the Index as a Service of Logs Data Platform is a turnkey solution fully managed by OVHcloud.

Follow the guide!

So what’s new in this new version?

You expressed us the wish to have a more flexible invoicing, so we changed our pricing model to pay-as-you-go. Indeed, pay-as-you-go makes invoicing simpler, more readable and predictable. Moreover, you can now take advantage of thresholds and alerts to improve your consumption efficiency.

Until recently our logs analysis platform was only available for French customers, now Logs Data Platform is available in all countries and in all languages!

Security and confidentiality are becoming more and more important in company policies, so in order to comply with your stringent security requirements, we have created the Enterprise Logs Account on Logs Data Platform. Thanks to this dedicated cluster you are totally isolated. It will allow us to offer you brand new features such as the Network Access Control List or customizable retention period.

Moreover

Moreover, Logs Data Platform will soon be ISO/IEC certified.
Indeed, we are only a few weeks away from obtaining the ISO/IEC 27001, 27017, 27018 and 27701 certificates.
What do these norms correspond to?

We won’t go into boring legal details here, but to put it simply:

ISO/IEC 27001:2013 Certification and ISMS relating to Information security management systems for cloud services
ISO/IEC 27017:2015 Certification relating to information security controls for cloud services
ISO/IEC 27018:2014 Code of practice for protection of personally identifiable information for cloud services
ISO/IEC 27701:2019 Certification and PIMS relating to personal data processing security management

To summarize, these ISO/IEC certifications ensure the presence of an Information Security Management System (ISMS) for the management of risks, vulnerabilities and business continuity, as well as a Privacy Information Management System (PIMS).

That said, if the legal is your passion, you’ll find more details on iso.org.

And that’s not all, at the same time our Enterprise Logs Account offer will be HDS compatible to host health data. You will find more information here.

NB:

A few weeks after we released these improvements on our Logs Data Platform Product, Elastic announced changes in licensing for the future versions of Elasticsearch and Kibana offered as a service.

Other members of the Elastic open-source community announced that open versions of those components will continue to exist. Be reassured that the platform as it exists now is not impacted by the change and we will in the mid term future explore the best options to keep offering your the latest feature of the ecosystem, sticking to our S.M.A.R.T. values.

Contact us

https://community.ovh.com/c/platform/data-platforms

Some useful links

Why are you still managing your data processing clusters?

Mojtaba Imani — Wed, 30 Sep 2020 16:14:35 +0000

Cluster computing is used to share a computation load among a group of computers. This achieves a higher level of performance and scalability.

Apache Spark is an open-source, distributed and cluster-computing framework, that is much faster than the previous one (Hadoop MapReduce). This is thanks to features like in-memory processing and lazy evaluation. Apache Spark is the most popular tool in this category.

The analytics engine is the leading framework for large-scale SQL, batch processing, stream processing and machine learning. For coding in Spark, you have the option of using different programming languages; including Java, Scala, Python, R and SQL. It can be run locally on a single machine, or on a cluster of computers for task distribution.

By using Apache Spark, you can process your data in your local computer, or you can create a cluster to send any number of processing jobs.

It is possible to create your cluster with physical computers on-premises, with virtual machines in a hosting company, or with any cloud provider. With your own cluster, you’ll have the ability to send Spark jobs whenever you like.

Cluster Management Challenges

If you are processing a huge amount of data and you expect to have results in a reasonable time, your local computer won’t be enough. You need a cluster of computers to divide the data and process workloads – several computers are run in parallel to speed up the task.

Creating and managing your own cluster of computers, however, is not an easy task. You will face several challenges:

Cluster Creation

Creating an Apache Spark cluster is an arduous task.

First, you’ll need to create a cluster of computers and install an operating system, development tools (Python, Java, Scala), etc.

Second, you’ll need to select a version of Apache Spark and install the necessary nodes (master and workers).

Lastly, you’ll need to connect all these nodes together to finalize your Apache Spark cluster.

All in all, it can take several hours to create and configure a new Apache Spark cluster.

Cluster Management

But once you have your own cluster up and running, your job is far from over. Is your cluster working well? Is each and every node healthy?

Here is the second challenge: facing the pain of cluster management!

You’ll need to check the health of all your nodes manually or, preferably, install monitoring tools that report any issues nodes may encounter.

Do the nodes have enough disk space available for new tasks? One key issue faced by Apache Spark clusters, is that some tasks write a lot of data in the local disk space of nodes without deleting them. Disk space is a common problem and, as you may know, a lack of disk space eliminates the possibility of running more tasks.

Do you need to run multiple Spark jobs at the same time? Sometimes a single job occupies all the CPU and RAM resources in your cluster and doesn’t allow other jobs to start and run at the same time.

These are only a few of the problems you will meet while working with your own clusters.

Cluster Security

Now for the third challenge! What is even more important than having a cluster up and running smoothly?

You guessed it: security. After all, Apache Spark is a Data Processing tool. And data is very sensitive.

Where in your cluster, does security matter most?

What about the connection between nodes? Are they connected with a secured (and fast) connection? Who has access to the servers housing your cluster?

If you have created your cluster on the cloud and you are working with sensitive data, you’ll need to address these issues by securing each and every node and encrypting communications between them.

Spark Version

Here is your fourth challenge: managing your cluster’s user expectations. In some cases this may be a less daunting task, but not all.

There isn’t a whole lot you can do to change the expectations of your cluster’s users, but here’s a common example to help you prepare:

Do your users like to test their codes with different versions of Apache Spark? Or do they require the latest feature from the latest Spark nightly version?

When you create an Apache Spark cluster, you have to select one version of Spark. Your whole cluster will be bound to it, and it alone. This means it isn’t possible for several versions to cohabit in the same cluster.

So, either you’ll have to change the Spark version of your whole cluster or create another separated cluster. And of course, if you decide to do that, you have to create a down time on your cluster to make the modifications.

Cluster Efficiency

And for the final challenge: scaling!

How can you get the most benefit from the cluster resources you are paying for? Are you paying for your cluster but feel you’re not using it efficiently? Is your cluster too big for your users? Is it running, but empty of jobs during the holiday seasons?

When you have a processing cluster – especially if you have a lot of precious resources in your cluster that you’re paying for – you will always have one major concern: is your cluster being utilised as efficiently as possible. There will be occasions that some resources in your cluster are idle, or where you are only running small jobs that don’t require the amount of resources in your cluster. Scaling will become a major pain point.

OVHcloud Data Processing (ODP) Solution

At OVHcloud, we created a new data service called OVHcloud Data Processing (ODP) to address all cluster management challenges mentioned above.

Let’s assume that you have some data to process but you don’t have the desire, the time, the budget or the skills to overcome these challenges. Maybe you don’t want to, or can’t, ask for help from colleagues or consultants to spawn and manage a cluster. How can you still make use of Apache Spark? This is where the ODP service comes in!

By using ODP, you need to write your Apache Spark code and ODP will do the rest. It will create a disposable dedicated Apache Spark cluster over the cloud for each job in just a few seconds – then delete the whole cluster after finishing the job. You only pay for the requested resources and only for the duration of the computation. There is no need to pay for hours and hours of cloud servers, while you are busy with the cluster installation, configuration or even debugging and updating the engine version.

ODP Cluster Creation

When you submit your job, ODP will create an apache spark cluster dedicated to that job in just a few seconds. This cluster will have the amount of CPU and RAM and the number of workers specified in the job submit form. All necessary software will be automatically installed. You don’t need to worry at all about a cluster, how to install, configure, or secure it. ODP does all of this for you.

ODP Cluster Management

When you submit your job, cluster management and monitoring are configured and handled by ODP. All logging and monitoring mechanisms and tools will be installed automatically for you. You will have a Grafana dashboard to monitor different parameters and resources of your job and you will have access to the official Apache Spark dashboard.

You don’t need to worry about cleaning the local disk of each node because each job will start with fresh resources. It isn’t possible, therefore, for one job to delay another job as each job has new, dedicated resources.

ODP Cluster Security

ODP will take care of the security and privacy of your cluster as well. Firstly, all communications between the Spark nodes are encrypted. Secondly, None of your job’s nodes are accessible from the outside. ODP only allows limited ports to be open for your cluster, so that you are still able to load or push your data.

ODP Cluster Spark Version

When it comes to using multiple Spark versions on the same cluster, ODP offers a solution. As every job possesses its own dedicated resources, each job can use any version currently supported by the service, independently from any other job running at the same time. When submitting an Apache Spark job through ODP, you will first select the version of Apache Spark you would like to use. When the Apache Spark community releases a new version, it will soon become available in ODP and you can then submit another job with the new Spark version as well. This means you don’t need to keep updating the Spark version of your whole cluster anymore.

ODP Cluster Efficiency

Each time you submit a job, you’ll have to define exactly how many resources and workers you would like to use for that job. As said earlier, each job has its own dedicated resources so you will be able to have small jobs running alongside much bigger ones. This flexibility, means that you will never have to worry about having an idle cluster. You pay for the resources you use, when you use them.

How to start?

If you’re interested in trying ODP, you can check out: https://www.ovhcloud.com/en/public-cloud/data-processing/ or you can easily create an account at www.ovhcloud.com and select “data processing” in the public cloud section. It is also possible to ask questions directly from the product team in the ODP public gitter channel https://gitter.im/ovh/data-processing.

Conclusion

With ODP, the challenges of running an Apache Spark cluster are removed, or alleviated (we still can’t do much about users’ expectations!) You don’t have to worry about the lack of resources necessary to process your data, or the need to create, install and manage your own cluster.

Focus on your processing algorithm and ODP will do the rest.

Improving the quality of data with Apache Spark

Hubert Stefani — Tue, 15 Sep 2020 15:34:26 +0000

Today we are proposing you a guest post by Hubert Stefani, Chief Innovation Officer and Cofounder of Novagen Conseil

As data consultant experts and heavy Apache Spark users, we felt honoured to become early adopters of OVHcloudData Processing. As a first use case to test this offering, we chose our quality assessment process.

As a data consultancy company based in Paris, we build complete and innovative data strategies for our large corporate and public customers: the top fortune banks, public authorities, retailers, fashion industry, transportation leaders etc. We offer them huge scale BI, data lake creation and management, business innovation with data science. Within our Data Lab, we select the best-in-class technology and create what we call ‘boosters’ ie ready to-deploy or customized data assets.

When it comes to selecting a new technology solution, we have the following check list:

Innovation and evolutivity: depth of functionalities, additional value and usability
Performance and cost-effectiveness: intrinsic performances, but also technical architectures that adapt to customer needs
Open standards and governance: to support our customers’ cloud or multi-cloud strategies, we choose to rely on open standards to deploy on different targets and preserve reversibility.

Apache Spark, our Swiss Army knife

About a month ago OVHcloud’s Data and AI Product Manager, Bastien Verdebout approached us to test its new product OVHcloud Data Processing, built on top of Apache Spark. The answer was of course yes!

One of the reasons we felt so eager to discover this data processing as a service solution was that we have an extensive usage of Apache Spark; it’s our our Swiss Army knife to process data.

It works on extremely high scale of data,
It meets the needs of data engineering and data science,
It allows the processing of data at rest and data streaming
It’s the de facto standard for data workloads on-premises and in the Cloud,
It offers built-in APIs for Python, Scala, Java and R.

We have progressively developped software assets on top of Apache Spark to address recurring challenges such as:

ETL processing in data lake environnements,
Quality KPIs on top of data lake sources,
Machine Learning algorithm for Natural Language Processing, Time Series predictions…

The ideal use case: data quality assessment

We have considered the following charateristics of OVHCloud Data processing:

Processing engine built on top of Apache Spark 2.4.3
Jobs start after a few seconds (vs minutes to launch a cluster)
Ability to adjust power dedicated to different Spark jobs: start with low power (1 driver and 1 executor with 4 cores and 8Gb of memory) to high scale processing (potential hundreds of cores and Gb of memories)
A full Compute/Storage separation aligned with standard of cloud architectures, including S3 APIs to access data stored in Object Storage layer.
Jobs execution and monitoring through Command Line Interface and API

These characteristics led us to choose our quality assessment process as an ideal use case which requires both interactivity and adjustable compute resources to deliver quality KPIs through Spark processes.

OVHCloud Data Processing at work

The corresponding command generated by our software is:

./ovh-spark-submit --projectid ec7d2cb6da084055a0501b2d8d8d62a1 \
  --class tech.novagen.spark.Launcher --driver-cores 4 --driver-memory 8G \
  --executor-cores 4 --executor-memory 8G --num-executors 5 \ 
  swift://sparkjars/QualitySparkExecutor-1.0-spark.jar --apiServer=5.1.1.2:80

We have a command which is quite similar to a usual spark-submit, except for the jar path, which requires the binary to be in an Object Storage bucket that we access with swift url specification. (NB: this command could have been created with a call to the OVHCloud Data Processing API).

Starting from this point, we can now fine tune our process portfolio and play with the allocation of different power with little limitation (except for the quotas assigned to any public cloud project).

Real-time display of job logs

In the end, for tuning and post-mortem job analysis, we can take advantage of the saved log files. It is noteworthy that OVHcloud Data Processing offers a real time display of job logs which is very convenient and provides a complementary supervision through Grafana dashboards.

This is a first yet significant test of OVHcloud Data Processing; so far, it proved an excellent match with the Novagen quality process use case and allowed us to validate several crucial points when it comes to testing a new data solution:

This is the beginning of this product, and we will have a close look at the upcoming functionalities. The OVHCloud team unveiled part of its roadmap, and it looks really promising!
Hubert Stefani, Chief Innovation Officer of Novagen Conseil

Do you need to process your data? Try the new OVHcloud Data Processing service!

Mojtaba Imani — Wed, 22 Jul 2020 12:55:45 +0000

Today, we are generating more data than ever. 90 percent of global data has been generated in the last 2 years. By 2025 the amount of data in the world is estimated to reach 175 Zettabyte. In total, people write 500 million tweets per day, and autonomous cars generate 20TB of data every hour. By the year 2025 more than 75 billion IoT devices will be connected to the web, which will all generate data. Nowadays, devices and services that generate data are everywhere. .

There is also the notion of data exhaust which is the by-product of people’s online activities. It’s the data that’s generated as a result of someone visiting a website, buying a product or searching for something using a search engine. You may have heard this data described as Metadata.

We will start drowning in a flood of data unless we learn how to swim – how to benefit from vast amounts of data. To do this we need to be able to process the data for the sake of better decision-making, preventing fraud and danger, inventing better products or even predicting the future. The possibilities are endless.

But how can we process this huge amount of data? For sure, it’s not possible to do it the old fashioned way. We need to upgrade our methods and equipment.

Big data are data sets that have a volume, velocity and variety too large to be processed by a local computer. So what are the requirements for processing “big data”?

1- Process data in parallel

Data is everywhere and available in huge quantities. First off, lets apply the old rule: ‘Divide and Conquer’.

Dividing the data means that we will need data and processing tasks to be distributed across several computers. These will need to be set in a cluster, to perform these different tasks in parallel and gain reasonable performance and speed boosts.

Lets assume that you needed to find out whats trending on twitter. You would have to process around 500 million tweets with one computer in one hour. Not so easy now, is it? And how would you benefit if it took a month to process? What is the value in finding the trend of the day, one month later?

Parallelization is more than a “nice to have” feature. It’s a requirement!

2- Process data in the cloud

The second step is to create and manage these clusters in an efficient way.

You have several choices here, like creating clusters with your own servers and managing them by yourself. But that’s time consuming and costs quite a lot as well. It also lacks some features you may wish to use, like flexibility. For these reasons, the cloud appears to be a better and better solution every day for a lot of companies.

The elasticity that cloud solutions provide, helps companies to be flexible and adapt infrastructure to their needs. With data processing, for example, we will need to be able to scale up and down our computing cluster easily to adapt the computing power to the volume of data we want to process according to our constraints (time, costs, etc.).

And then, even if you decide to use a cloud provider, you will have several solutions to choose from, each with their own drawbacks. One of these solutions, is to create a computing cluster over the dedicated servers or public cloud instances and send different processing jobs to the cluster. The main drawback, in this solution, would be that if no processing is done, you’re still paying for the reserved but unused processing resources.

A more efficient way would therefore be to create a dedicated cluster for each processing job, with the right resources for that job, and then delete the cluster after. Each new job would have it’s own cluster, sized as needed, spawned on demand. But this solution would only be feasible if the creation of a computing cluster took but a few seconds and not minutes or hours.

Data locality

When creating a processing cluster, it is also possible to consider data locality. Here, cloud providers usually offer several regions spread across data centers situated in different countries. It has two main benefits:

The first one is not directly linked to data locality but more of a legal point. Depending on where your customers are, and where your data is, you may need to comply with local data privacy laws and regulations. You may need to keep your data in a specific region or country and not be able to process it outside. So, to create a cluster of computers in that region, it is easier to process data while complying with local privacy policies.

The second benefit is, of course, the potential to create your processing clusters in close physical proximity to your data. According to estimations, by the year 2025, almost 50 percent of data in the world will be stored in the cloud. On-premises data storing is in decline.

Therefore, using cloud providers that have several regions gives companies the benefit of having processing clusters near their datas physical location – this greatly reduces the time (and costs!) it takes to fetch data.

While processing your data in the cloud may not be a requirement per se, its certainly more beneficial than doing it yourself.

3- Process data with the most appropriate distributed computing technology

The third and final step, is to decide how you are going to process your data, meaning with what tools. Again, you could do it by yourself, by implementing a distributed processing engine in a language of your choice. But where’s the fun in that? (okay, for some of us it might actually be quite fun!)

But it would be astronomically complex. You would need to write code to divide the data into several parts and send each part to a computer in your cluster. Each computer would then process its part of data, and you would need to find a way to retrieve the results of each part and to re-aggregate everything into a coherent result. In short, it would be a lot of work, with a lot of debugging.

Apache Spark

But there are technologies that have been developed specifically for this purpose. They distribute the data and processing tasks automatically and retrieve the results for you. Currently, the most popular distributed computing technology, especially in relation to Data Science subjects, is Apache Spark.

Apache Spark is an open-source, distributed, cluster-computing framework. It is much faster than the previous one, Hadoop MapReduce, thanks to features like in-memory processing and lazy evaluation.

Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning. For coding in Apache Spark, you have the option of using different programming languages (including Java, Scala, Python, R and SQL). It can run locally in a single machine or in a cluster of computers to distribute its tasks.

As you can see in the google trends data chart above, there are alternatives. But Apache Spark has definitely established itself as the leader in distributed computing tools.

OVHcloud Data Processing (ODP)

OVHcloud is the leader of European hosting and cloud providers with a wide range of cloud services like public and private cloud, managed Kubernetes and cloud storage. But besides all the hosting and cloud services, OVHcloud also provides a range of big data analytics and artificial intelligence services as well as platforms.

One of the data services offered by OVHcloud is OVHcloud Data Processing (ODP). It is a service that allows you to submit a processing job without worrying about the cluster behind it. You just have to specify the resources you need for the job and the service will abstract the cluster creation, and destroy it for you as soon as your job finishes. In other words, you don’t have to think about clusters any more. Decide on how many resources you will need to process your data in an efficient way, and then let OVHcloud Data Processing do the rest.

On-demand, job-specific Spark clusters

The service will deploy a temporary, job-specific Apache Spark Cluster, then configure and secure it automatically. You don’t need to have any prior knowledge or skills related to the cloud, networking, cluster management systems, security, etc. You only have to focus on your processing algorithm and Apache Spark code.

This service will download your Apache Spark code from one of your Object Storage containers, and ask you how much RAM and CPU cores you would like your job to use. You will also have to specify the region you want the processing to take place in. Last but not least, you will then have to choose the Apache Spark version you want to use to run your code. The service will then launch your job within a few seconds, according to the specified parameters until your job’s completion. Nothing else to do on your part. No cluster creation, no cluster destruction. Just focus on your code.

Your local computer resources no longer limit the amount of data you can process. You can run any number of processing jobs in parallel in any region and any version of Spark. Its also very fast and very easy.

How does it work ?

On your side You just need to:

Create a container in OVHcloud Object Storage and upload the Apache Spark code and any other required files to this container. Be careful not to put your data in the same container as well, as the whole container will be downloaded by the service.
You then have to define the processing engine (like Apache Spark) and its version, as well as the geographical region and the amount of resources (CPU cores, RAM and number of worker nodes) you need. There are three different ways to execute this (OVHcloud Control Panel, API or ODP CLI)

These are different steps that happen when you run a processing job in OVHcloud Data Processing (ODP) platform:

ODP will take over and handle the deployment and execution of your job according to the specifications that you defined.
Before starting your job, ODP will download all files that you uploaded in the specified container.
Next, ODP will run your job in a dedicated environment, created specifically for your job. Apart from a limitation on the available ports (list available here), your job can then connect to any data source (databases, object storage, etc) to read or write data (as long as they are reachable through the Internet)
When the job is complete, ODP stores the execution output logs to your Object Storage and then deletes the whole cluster immediately.
You will be charged for the amount of resources you specified and only for the duration of your job computation, on a per-minutes basis.

Different ways to submit a job?

There are three different ways that you can submit a processing job to ODP, depending on your requirements. These three ways are OVHcloud Manager, OVHcloud API and CLI (Command Line Interface).

1. OVHcloud Manager

To submit a job with OVHcloud Manager you need to go to OVHcloud.com and login with your OVHcloud account (or create one if neccessary). Then go to the “Public Cloud” page and select the “Data Processing” link on the left panel and submit a job by clicking on “Start a new job”.

Before submitting a job you need to create a container in OVHcloud Object Storage by clicking on “Object Storage” link on the left panel and upload your Apache Spark code and any other required files.

2. OVHcloud API

You can submit a job to ODP by using OVHcloud API. For more information, you can see the OVHcloud API web page https://api.ovh.com/. You can create job submit automation by using ODP API.

3. ODP CLI (Command Line Interface)

ODP has an open source Command Line Interface that you can find in OVH public GitHub at https://github.com/ovh/data-processing-spark-submit). By using CLI, you can upload your files and codes and create your Apache Spark cluster together with just one command.

Some ODP benefits

You can either always run your processing tasks in your local computer, or you can create an Apache Spark cluster in your local premises with any cloud provider. This means you can manage that cluster yourself or using similar services from other competitors. But ODP has several benefits, it is good to have them in your mind when deciding on a solution:

No cluster management or configuration skills or experience is needed.
Not limited by resources and easy and fast. (The only limit is your cloud account quota)
Pay as you go model with easy pricing and no hidden cost. (per-minutes billing)
Per job resource definition (no more resources lost compared to a mutualised cluster)
Ease of managing Apache Spark version (You select the version for each job and you can even have different jobs with different versions of the Apache Spark at the same time)
Region selection (You can select different regions based on your data locality or data privacy policy)
Start a Data Processing job in just a few seconds
Real-time logs (when your job is running, you will receive real-time logs in your Customer Panel)
Full output log will be available just after finishing the job (some competitors take minutes to deliver logs to you)
Job submit automation (by using ODP API or CLI)
Data Privacy (OVHcloud is European company and all customers are strictly protected by European GDPR)

Conclusion

With the advance of new technologies and devices, we are flooded with data. More and more, it is essential for businesses and for academic research to process data sets and understand where the value is. By providing the OVHcloud Data Processing (ODP) service, our goal is to provide you with one of the easiest and most efficient platforms to process your data. Just focus on your processing algorithm and ODP will handle the rest for you.