Apache Spark Archives - OVHcloud Blog

Why are you still managing your data processing clusters?

Mojtaba Imani — Wed, 30 Sep 2020 16:14:35 +0000

Cluster computing is used to share a computation load among a group of computers. This achieves a higher level of performance and scalability.

Apache Spark is an open-source, distributed and cluster-computing framework, that is much faster than the previous one (Hadoop MapReduce). This is thanks to features like in-memory processing and lazy evaluation. Apache Spark is the most popular tool in this category.

The analytics engine is the leading framework for large-scale SQL, batch processing, stream processing and machine learning. For coding in Spark, you have the option of using different programming languages; including Java, Scala, Python, R and SQL. It can be run locally on a single machine, or on a cluster of computers for task distribution.

By using Apache Spark, you can process your data in your local computer, or you can create a cluster to send any number of processing jobs.

It is possible to create your cluster with physical computers on-premises, with virtual machines in a hosting company, or with any cloud provider. With your own cluster, you’ll have the ability to send Spark jobs whenever you like.

Cluster Management Challenges

If you are processing a huge amount of data and you expect to have results in a reasonable time, your local computer won’t be enough. You need a cluster of computers to divide the data and process workloads – several computers are run in parallel to speed up the task.

Creating and managing your own cluster of computers, however, is not an easy task. You will face several challenges:

Cluster Creation

Creating an Apache Spark cluster is an arduous task.

First, you’ll need to create a cluster of computers and install an operating system, development tools (Python, Java, Scala), etc.

Second, you’ll need to select a version of Apache Spark and install the necessary nodes (master and workers).

Lastly, you’ll need to connect all these nodes together to finalize your Apache Spark cluster.

All in all, it can take several hours to create and configure a new Apache Spark cluster.

Cluster Management

But once you have your own cluster up and running, your job is far from over. Is your cluster working well? Is each and every node healthy?

Here is the second challenge: facing the pain of cluster management!

You’ll need to check the health of all your nodes manually or, preferably, install monitoring tools that report any issues nodes may encounter.

Do the nodes have enough disk space available for new tasks? One key issue faced by Apache Spark clusters, is that some tasks write a lot of data in the local disk space of nodes without deleting them. Disk space is a common problem and, as you may know, a lack of disk space eliminates the possibility of running more tasks.

Do you need to run multiple Spark jobs at the same time? Sometimes a single job occupies all the CPU and RAM resources in your cluster and doesn’t allow other jobs to start and run at the same time.

These are only a few of the problems you will meet while working with your own clusters.

Cluster Security

Now for the third challenge! What is even more important than having a cluster up and running smoothly?

You guessed it: security. After all, Apache Spark is a Data Processing tool. And data is very sensitive.

Where in your cluster, does security matter most?

What about the connection between nodes? Are they connected with a secured (and fast) connection? Who has access to the servers housing your cluster?

If you have created your cluster on the cloud and you are working with sensitive data, you’ll need to address these issues by securing each and every node and encrypting communications between them.

Spark Version

Here is your fourth challenge: managing your cluster’s user expectations. In some cases this may be a less daunting task, but not all.

There isn’t a whole lot you can do to change the expectations of your cluster’s users, but here’s a common example to help you prepare:

Do your users like to test their codes with different versions of Apache Spark? Or do they require the latest feature from the latest Spark nightly version?

When you create an Apache Spark cluster, you have to select one version of Spark. Your whole cluster will be bound to it, and it alone. This means it isn’t possible for several versions to cohabit in the same cluster.

So, either you’ll have to change the Spark version of your whole cluster or create another separated cluster. And of course, if you decide to do that, you have to create a down time on your cluster to make the modifications.

Cluster Efficiency

And for the final challenge: scaling!

How can you get the most benefit from the cluster resources you are paying for? Are you paying for your cluster but feel you’re not using it efficiently? Is your cluster too big for your users? Is it running, but empty of jobs during the holiday seasons?

When you have a processing cluster – especially if you have a lot of precious resources in your cluster that you’re paying for – you will always have one major concern: is your cluster being utilised as efficiently as possible. There will be occasions that some resources in your cluster are idle, or where you are only running small jobs that don’t require the amount of resources in your cluster. Scaling will become a major pain point.

OVHcloud Data Processing (ODP) Solution

At OVHcloud, we created a new data service called OVHcloud Data Processing (ODP) to address all cluster management challenges mentioned above.

Let’s assume that you have some data to process but you don’t have the desire, the time, the budget or the skills to overcome these challenges. Maybe you don’t want to, or can’t, ask for help from colleagues or consultants to spawn and manage a cluster. How can you still make use of Apache Spark? This is where the ODP service comes in!

By using ODP, you need to write your Apache Spark code and ODP will do the rest. It will create a disposable dedicated Apache Spark cluster over the cloud for each job in just a few seconds – then delete the whole cluster after finishing the job. You only pay for the requested resources and only for the duration of the computation. There is no need to pay for hours and hours of cloud servers, while you are busy with the cluster installation, configuration or even debugging and updating the engine version.

ODP Cluster Creation

When you submit your job, ODP will create an apache spark cluster dedicated to that job in just a few seconds. This cluster will have the amount of CPU and RAM and the number of workers specified in the job submit form. All necessary software will be automatically installed. You don’t need to worry at all about a cluster, how to install, configure, or secure it. ODP does all of this for you.

ODP Cluster Management

When you submit your job, cluster management and monitoring are configured and handled by ODP. All logging and monitoring mechanisms and tools will be installed automatically for you. You will have a Grafana dashboard to monitor different parameters and resources of your job and you will have access to the official Apache Spark dashboard.

You don’t need to worry about cleaning the local disk of each node because each job will start with fresh resources. It isn’t possible, therefore, for one job to delay another job as each job has new, dedicated resources.

ODP Cluster Security

ODP will take care of the security and privacy of your cluster as well. Firstly, all communications between the Spark nodes are encrypted. Secondly, None of your job’s nodes are accessible from the outside. ODP only allows limited ports to be open for your cluster, so that you are still able to load or push your data.

ODP Cluster Spark Version

When it comes to using multiple Spark versions on the same cluster, ODP offers a solution. As every job possesses its own dedicated resources, each job can use any version currently supported by the service, independently from any other job running at the same time. When submitting an Apache Spark job through ODP, you will first select the version of Apache Spark you would like to use. When the Apache Spark community releases a new version, it will soon become available in ODP and you can then submit another job with the new Spark version as well. This means you don’t need to keep updating the Spark version of your whole cluster anymore.

ODP Cluster Efficiency

Each time you submit a job, you’ll have to define exactly how many resources and workers you would like to use for that job. As said earlier, each job has its own dedicated resources so you will be able to have small jobs running alongside much bigger ones. This flexibility, means that you will never have to worry about having an idle cluster. You pay for the resources you use, when you use them.

How to start?

If you’re interested in trying ODP, you can check out: https://www.ovhcloud.com/en/public-cloud/data-processing/ or you can easily create an account at www.ovhcloud.com and select “data processing” in the public cloud section. It is also possible to ask questions directly from the product team in the ODP public gitter channel https://gitter.im/ovh/data-processing.

Conclusion

With ODP, the challenges of running an Apache Spark cluster are removed, or alleviated (we still can’t do much about users’ expectations!) You don’t have to worry about the lack of resources necessary to process your data, or the need to create, install and manage your own cluster.

Focus on your processing algorithm and ODP will do the rest.

Improving the quality of data with Apache Spark

Hubert Stefani — Tue, 15 Sep 2020 15:34:26 +0000

Today we are proposing you a guest post by Hubert Stefani, Chief Innovation Officer and Cofounder of Novagen Conseil

As data consultant experts and heavy Apache Spark users, we felt honoured to become early adopters of OVHcloudData Processing. As a first use case to test this offering, we chose our quality assessment process.

As a data consultancy company based in Paris, we build complete and innovative data strategies for our large corporate and public customers: the top fortune banks, public authorities, retailers, fashion industry, transportation leaders etc. We offer them huge scale BI, data lake creation and management, business innovation with data science. Within our Data Lab, we select the best-in-class technology and create what we call ‘boosters’ ie ready to-deploy or customized data assets.

When it comes to selecting a new technology solution, we have the following check list:

Innovation and evolutivity: depth of functionalities, additional value and usability
Performance and cost-effectiveness: intrinsic performances, but also technical architectures that adapt to customer needs
Open standards and governance: to support our customers’ cloud or multi-cloud strategies, we choose to rely on open standards to deploy on different targets and preserve reversibility.

Apache Spark, our Swiss Army knife

About a month ago OVHcloud’s Data and AI Product Manager, Bastien Verdebout approached us to test its new product OVHcloud Data Processing, built on top of Apache Spark. The answer was of course yes!

One of the reasons we felt so eager to discover this data processing as a service solution was that we have an extensive usage of Apache Spark; it’s our our Swiss Army knife to process data.

It works on extremely high scale of data,
It meets the needs of data engineering and data science,
It allows the processing of data at rest and data streaming
It’s the de facto standard for data workloads on-premises and in the Cloud,
It offers built-in APIs for Python, Scala, Java and R.

We have progressively developped software assets on top of Apache Spark to address recurring challenges such as:

ETL processing in data lake environnements,
Quality KPIs on top of data lake sources,
Machine Learning algorithm for Natural Language Processing, Time Series predictions…

The ideal use case: data quality assessment

We have considered the following charateristics of OVHCloud Data processing:

Processing engine built on top of Apache Spark 2.4.3
Jobs start after a few seconds (vs minutes to launch a cluster)
Ability to adjust power dedicated to different Spark jobs: start with low power (1 driver and 1 executor with 4 cores and 8Gb of memory) to high scale processing (potential hundreds of cores and Gb of memories)
A full Compute/Storage separation aligned with standard of cloud architectures, including S3 APIs to access data stored in Object Storage layer.
Jobs execution and monitoring through Command Line Interface and API

These characteristics led us to choose our quality assessment process as an ideal use case which requires both interactivity and adjustable compute resources to deliver quality KPIs through Spark processes.

OVHCloud Data Processing at work

The corresponding command generated by our software is:

./ovh-spark-submit --projectid ec7d2cb6da084055a0501b2d8d8d62a1 \
  --class tech.novagen.spark.Launcher --driver-cores 4 --driver-memory 8G \
  --executor-cores 4 --executor-memory 8G --num-executors 5 \ 
  swift://sparkjars/QualitySparkExecutor-1.0-spark.jar --apiServer=5.1.1.2:80

We have a command which is quite similar to a usual spark-submit, except for the jar path, which requires the binary to be in an Object Storage bucket that we access with swift url specification. (NB: this command could have been created with a call to the OVHCloud Data Processing API).

Starting from this point, we can now fine tune our process portfolio and play with the allocation of different power with little limitation (except for the quotas assigned to any public cloud project).

Real-time display of job logs

In the end, for tuning and post-mortem job analysis, we can take advantage of the saved log files. It is noteworthy that OVHcloud Data Processing offers a real time display of job logs which is very convenient and provides a complementary supervision through Grafana dashboards.

This is a first yet significant test of OVHcloud Data Processing; so far, it proved an excellent match with the Novagen quality process use case and allowed us to validate several crucial points when it comes to testing a new data solution:

This is the beginning of this product, and we will have a close look at the upcoming functionalities. The OVHCloud team unveiled part of its roadmap, and it looks really promising!
Hubert Stefani, Chief Innovation Officer of Novagen Conseil

Do you need to process your data? Try the new OVHcloud Data Processing service!

Mojtaba Imani — Wed, 22 Jul 2020 12:55:45 +0000

Today, we are generating more data than ever. 90 percent of global data has been generated in the last 2 years. By 2025 the amount of data in the world is estimated to reach 175 Zettabyte. In total, people write 500 million tweets per day, and autonomous cars generate 20TB of data every hour. By the year 2025 more than 75 billion IoT devices will be connected to the web, which will all generate data. Nowadays, devices and services that generate data are everywhere. .

There is also the notion of data exhaust which is the by-product of people’s online activities. It’s the data that’s generated as a result of someone visiting a website, buying a product or searching for something using a search engine. You may have heard this data described as Metadata.

We will start drowning in a flood of data unless we learn how to swim – how to benefit from vast amounts of data. To do this we need to be able to process the data for the sake of better decision-making, preventing fraud and danger, inventing better products or even predicting the future. The possibilities are endless.

But how can we process this huge amount of data? For sure, it’s not possible to do it the old fashioned way. We need to upgrade our methods and equipment.

Big data are data sets that have a volume, velocity and variety too large to be processed by a local computer. So what are the requirements for processing “big data”?

1- Process data in parallel

Data is everywhere and available in huge quantities. First off, lets apply the old rule: ‘Divide and Conquer’.

Dividing the data means that we will need data and processing tasks to be distributed across several computers. These will need to be set in a cluster, to perform these different tasks in parallel and gain reasonable performance and speed boosts.

Lets assume that you needed to find out whats trending on twitter. You would have to process around 500 million tweets with one computer in one hour. Not so easy now, is it? And how would you benefit if it took a month to process? What is the value in finding the trend of the day, one month later?

Parallelization is more than a “nice to have” feature. It’s a requirement!

2- Process data in the cloud

The second step is to create and manage these clusters in an efficient way.

You have several choices here, like creating clusters with your own servers and managing them by yourself. But that’s time consuming and costs quite a lot as well. It also lacks some features you may wish to use, like flexibility. For these reasons, the cloud appears to be a better and better solution every day for a lot of companies.

The elasticity that cloud solutions provide, helps companies to be flexible and adapt infrastructure to their needs. With data processing, for example, we will need to be able to scale up and down our computing cluster easily to adapt the computing power to the volume of data we want to process according to our constraints (time, costs, etc.).

And then, even if you decide to use a cloud provider, you will have several solutions to choose from, each with their own drawbacks. One of these solutions, is to create a computing cluster over the dedicated servers or public cloud instances and send different processing jobs to the cluster. The main drawback, in this solution, would be that if no processing is done, you’re still paying for the reserved but unused processing resources.

A more efficient way would therefore be to create a dedicated cluster for each processing job, with the right resources for that job, and then delete the cluster after. Each new job would have it’s own cluster, sized as needed, spawned on demand. But this solution would only be feasible if the creation of a computing cluster took but a few seconds and not minutes or hours.

Data locality

When creating a processing cluster, it is also possible to consider data locality. Here, cloud providers usually offer several regions spread across data centers situated in different countries. It has two main benefits:

The first one is not directly linked to data locality but more of a legal point. Depending on where your customers are, and where your data is, you may need to comply with local data privacy laws and regulations. You may need to keep your data in a specific region or country and not be able to process it outside. So, to create a cluster of computers in that region, it is easier to process data while complying with local privacy policies.

The second benefit is, of course, the potential to create your processing clusters in close physical proximity to your data. According to estimations, by the year 2025, almost 50 percent of data in the world will be stored in the cloud. On-premises data storing is in decline.

Therefore, using cloud providers that have several regions gives companies the benefit of having processing clusters near their datas physical location – this greatly reduces the time (and costs!) it takes to fetch data.

While processing your data in the cloud may not be a requirement per se, its certainly more beneficial than doing it yourself.

3- Process data with the most appropriate distributed computing technology

The third and final step, is to decide how you are going to process your data, meaning with what tools. Again, you could do it by yourself, by implementing a distributed processing engine in a language of your choice. But where’s the fun in that? (okay, for some of us it might actually be quite fun!)

But it would be astronomically complex. You would need to write code to divide the data into several parts and send each part to a computer in your cluster. Each computer would then process its part of data, and you would need to find a way to retrieve the results of each part and to re-aggregate everything into a coherent result. In short, it would be a lot of work, with a lot of debugging.

Apache Spark

But there are technologies that have been developed specifically for this purpose. They distribute the data and processing tasks automatically and retrieve the results for you. Currently, the most popular distributed computing technology, especially in relation to Data Science subjects, is Apache Spark.

Apache Spark is an open-source, distributed, cluster-computing framework. It is much faster than the previous one, Hadoop MapReduce, thanks to features like in-memory processing and lazy evaluation.

Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning. For coding in Apache Spark, you have the option of using different programming languages (including Java, Scala, Python, R and SQL). It can run locally in a single machine or in a cluster of computers to distribute its tasks.

As you can see in the google trends data chart above, there are alternatives. But Apache Spark has definitely established itself as the leader in distributed computing tools.

OVHcloud Data Processing (ODP)

OVHcloud is the leader of European hosting and cloud providers with a wide range of cloud services like public and private cloud, managed Kubernetes and cloud storage. But besides all the hosting and cloud services, OVHcloud also provides a range of big data analytics and artificial intelligence services as well as platforms.

One of the data services offered by OVHcloud is OVHcloud Data Processing (ODP). It is a service that allows you to submit a processing job without worrying about the cluster behind it. You just have to specify the resources you need for the job and the service will abstract the cluster creation, and destroy it for you as soon as your job finishes. In other words, you don’t have to think about clusters any more. Decide on how many resources you will need to process your data in an efficient way, and then let OVHcloud Data Processing do the rest.

On-demand, job-specific Spark clusters

The service will deploy a temporary, job-specific Apache Spark Cluster, then configure and secure it automatically. You don’t need to have any prior knowledge or skills related to the cloud, networking, cluster management systems, security, etc. You only have to focus on your processing algorithm and Apache Spark code.

This service will download your Apache Spark code from one of your Object Storage containers, and ask you how much RAM and CPU cores you would like your job to use. You will also have to specify the region you want the processing to take place in. Last but not least, you will then have to choose the Apache Spark version you want to use to run your code. The service will then launch your job within a few seconds, according to the specified parameters until your job’s completion. Nothing else to do on your part. No cluster creation, no cluster destruction. Just focus on your code.

Your local computer resources no longer limit the amount of data you can process. You can run any number of processing jobs in parallel in any region and any version of Spark. Its also very fast and very easy.

How does it work ?

On your side You just need to:

Create a container in OVHcloud Object Storage and upload the Apache Spark code and any other required files to this container. Be careful not to put your data in the same container as well, as the whole container will be downloaded by the service.
You then have to define the processing engine (like Apache Spark) and its version, as well as the geographical region and the amount of resources (CPU cores, RAM and number of worker nodes) you need. There are three different ways to execute this (OVHcloud Control Panel, API or ODP CLI)

These are different steps that happen when you run a processing job in OVHcloud Data Processing (ODP) platform:

ODP will take over and handle the deployment and execution of your job according to the specifications that you defined.
Before starting your job, ODP will download all files that you uploaded in the specified container.
Next, ODP will run your job in a dedicated environment, created specifically for your job. Apart from a limitation on the available ports (list available here), your job can then connect to any data source (databases, object storage, etc) to read or write data (as long as they are reachable through the Internet)
When the job is complete, ODP stores the execution output logs to your Object Storage and then deletes the whole cluster immediately.
You will be charged for the amount of resources you specified and only for the duration of your job computation, on a per-minutes basis.

Different ways to submit a job?

There are three different ways that you can submit a processing job to ODP, depending on your requirements. These three ways are OVHcloud Manager, OVHcloud API and CLI (Command Line Interface).

1. OVHcloud Manager

To submit a job with OVHcloud Manager you need to go to OVHcloud.com and login with your OVHcloud account (or create one if neccessary). Then go to the “Public Cloud” page and select the “Data Processing” link on the left panel and submit a job by clicking on “Start a new job”.

Before submitting a job you need to create a container in OVHcloud Object Storage by clicking on “Object Storage” link on the left panel and upload your Apache Spark code and any other required files.

2. OVHcloud API

You can submit a job to ODP by using OVHcloud API. For more information, you can see the OVHcloud API web page https://api.ovh.com/. You can create job submit automation by using ODP API.

3. ODP CLI (Command Line Interface)

ODP has an open source Command Line Interface that you can find in OVH public GitHub at https://github.com/ovh/data-processing-spark-submit). By using CLI, you can upload your files and codes and create your Apache Spark cluster together with just one command.

Some ODP benefits

You can either always run your processing tasks in your local computer, or you can create an Apache Spark cluster in your local premises with any cloud provider. This means you can manage that cluster yourself or using similar services from other competitors. But ODP has several benefits, it is good to have them in your mind when deciding on a solution:

No cluster management or configuration skills or experience is needed.
Not limited by resources and easy and fast. (The only limit is your cloud account quota)
Pay as you go model with easy pricing and no hidden cost. (per-minutes billing)
Per job resource definition (no more resources lost compared to a mutualised cluster)
Ease of managing Apache Spark version (You select the version for each job and you can even have different jobs with different versions of the Apache Spark at the same time)
Region selection (You can select different regions based on your data locality or data privacy policy)
Start a Data Processing job in just a few seconds
Real-time logs (when your job is running, you will receive real-time logs in your Customer Panel)
Full output log will be available just after finishing the job (some competitors take minutes to deliver logs to you)
Job submit automation (by using ODP API or CLI)
Data Privacy (OVHcloud is European company and all customers are strictly protected by European GDPR)

Conclusion

With the advance of new technologies and devices, we are flooded with data. More and more, it is essential for businesses and for academic research to process data sets and understand where the value is. By providing the OVHcloud Data Processing (ODP) service, our goal is to provide you with one of the easiest and most efficient platforms to process your data. Just focus on your processing algorithm and ODP will handle the rest for you.

How to run massive data operations faster than ever, powered by Apache Spark and OVH Analytics Data Compute

Mojtaba Imani — Mon, 27 May 2019 11:48:26 +0000

If you’re reading this blog for the first time, welcome to the ongoing data revolution! Just after the industrial revolution came what we call the digital revolution, with millions of people and objects accessing a world wide network – the internet – all of them creating new content, new data.

Let’s think about ourselves… We now have smartphones taking pictures and sending texts, sports watches collecting data about our health, Twitter and Instagram accounts generating content, and many other use cases. As a result, data in all its forms is exponentially exploding all over the world.

90% of the total data in the world was generated during last two years. According to IDC, the amount of data in the world is set to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025. When we do a basic division, this represents approximately 34TB of data per person, including all countries and topologies.

Impressive, isn’t it?

This opens up a lot of new concepts and usages, but also, of course, new challenges. How do we store this data? How do we keep it secure and private? And the last but not least, how do we get value from this data, as this new giant datasphere needs to be processed? In other words, it needs to be used to extract values.

Potential results and applications are infinite: improving the agricultural field by analysing weather forecasts, understanding customers deeply, researching new vaccines, redefining urban environments by analysing traffic jams… The list goes on.

It seems easy at first, but it requires three main elements:

First, we need data. Sometimes these data sources can be heterogeneous (text, audio, video, pictures etc.), and we may need to “clean” them before they can be used efficiently.
Next, we need compute power. Think again about ourselves again: our brains can perform a lot of calculations and operations, but it’s impossible to split one task between multiple brains. Ask a friend to do multiplication with you, and you’ll see this for yourself. With computers though, anything is possible! We are now able to parallelise calculations across multiple computers (i.e. a cluster), allowing us to get the results we want faster than ever.
Last, we need a framework, which is a bunch of tools that allow you to use this datalake and compute power efficiently.

How do we build this? Let’s find out together!

Step 1: Find the right framework

As you’ll have seen from the title of this post, it’s not a secret that Apache Spark is our preferred tool at OVH.

We chose Apache Spark because it is an open-source distributed, general-purpose cluster-computing framework that has the largest open-source community in the world of big data, and it is up to 100 times faster than the previous cluster computing framework, Hadoop MapReduce, thanks to nice features like in-memory processing and lazy evaluation. Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning, with an easy-to-use API, and for coding in Spark, you have the option of using different programming languages, including Java, Scala, Python, R and SQL.

Other tools, like Apache Flink and Beam, look very promising as well, and will be part of our upcoming services.

The different components of Apache Spark are:

Apache Spark Core, which provides in-memory computing, and forms the basis of other components
Spark SQL, which provides structured and semi-structured data abstraction
Spark Streaming, which performs streaming analysis using RDD (Resilient Distributed Datasets) transformation
MLib (Machine Learning Library), which is a distributed machine learning framework above Spark
GraphX, which is a distributed graph processing framework on top of Spark

The Apache Spark architecture principle

Before going further, let’s take the time to understand how Apache Spark can be so fast by reviewing its workflow.

Here is a sample code in Python, where we will read a file and count the number of lines with the letter ‘a’, and the number of lines with the letter ‘b’.

from pyspark import SparkContext
 
logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
 
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
 
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
 
sc.stop()

This code is part of your Spark Application, also known as your Driver Program.

Each action (count(), in our example) will trigger jobs. Apache Spark will then split your work into multiple tasks that can be computed separately.

Apache Spark stores data in RDD (Resilient Distributed Datasets), which is an immutable distributed collection of objects, and then divides it into different logical partitions, so it can process each part in parallel, in different nodes of the cluster.

Task parallelism and in-memory computing are the key to being ultra-fast here. You can go deeper in the official documentation.

Step 2: Find the required compute power

We now have the tools, but they need compute power (we are mainly talking about CPU and RAM memory) to perform such massive operations, and this has to be scalable.

Lets talk about creating a cluster of computers. The old-fashioned way is to buy physical computers and the network equipment to connect them together, installing the OS and all required software and packages, installing Apache Spark on all the nodes, then configuring Spark’s standalone cluster management system and connecting all workers to the master node.

Obviously, this isn’t the best way. It takes a lot of time and needs some skilled engineers to do all the stuff. Also, assume that you did this difficult job and then finished your big data processing… What are you going to do with the cluster after that? Just leave it there or sell it on the second-hand market? What if you decided to perform some larger-scale processing and needed to add more computers to your cluster? You’d need to do all the software and network installations and configuration for new nodes.

A better way of creating a cluster is to use a Public Cloud provider. This way, you will have your servers deployed very quickly, only pay what you consume, and can delete the cluster after finishing your processing task. You’ll also be able to access your data much more easily than you would with an on-premises solution. It’s not a coincidence that, according to IDC, half of the total data in the world will be stored in the public cloud by 2025 [3].

But the main problem persists: you still need to install all the software and packages on each of the servers in your virtual cluster, then configure the network and routers, take security measures and configure the firewall, and finally, install and configure the Apache Spark cluster management system. It will take time and be prone to errors, and the longer it takes, the more you will be charged due to having those servers deployed in your cloud account.

Step 3: Take a rest, and discover OVH Analytics Data Compute

As we’ve just seen, building a cluster can be done manually, but it’s a boring and time-consuming task.

At OVH, we solved this problem by introducing a cluster-computing service called Analytics Data Compute, which will create a 100% ready, fully installed and configured Apache Spark cluster on the fly. By using this service, you don’t need to waste your time on server creation, network, firewalls and security configurations on each node of your cluster. You just focus on your tasks, and the compute cluster you need will appear as if by magic!

In fact, there’s nothing really magic about it… just automations made by OVH to simplify both our our life and yours. We needed this kind of tool internally for large computations, and then crafted it into a product for you.

The concept is quite simple: you launch an Apache Spark job as normal through the command line or API, and a full Apache Spark cluster will be built on the fly, just for your job. Once the processing is done, we delete the cluster and you’re invoiced for the exact resources that were used (on an hourly basis, for now).

This way, we are able to rapidly scale from one to thousands of virtual machines, allowing you to use thousands of CPU cores and thousands GB of memory RAM.

To use Analytics Data Compute, you need to download a small, open-source client software package from OVH repository, called ovh-spark-submit.

This client was made with a view to keeping the official spark-submit command line syntax of Apache Spark. Most of the options and syntax are the same, although the OVH version has some more options related to infrastructure and cluster management. So, this way, you simply request to run your code over your data in a cluster of specific nodes, and the tool will create a cluster with the specified number of nodes, install all packages and software (including Spark and its cluster management system), and then configure the network and firewall. After creating the cluster, OVH Analytics Data Compute will run your Spark code over it, return the result to the user, and then delete the whole thing once it’s done. Much more efficient!

Let’s get it started… Feel the power!

The good news is that If you are already familiar with the spark-submit command line of Apache Spark, you don’t need to learn any new command line tools, as ovh-spark-submit uses almost the exact same options and commands.

Let’s look at an example, where we’ll calculate the famous Pi number’s decimals, first with the original Apache Spark syntax, and then with the ovh-spark-submit client:

./spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

./ovh-spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

You can see that the only difference is “ovh-” at the beginning of the command line, while the rest is the same. And by running the ovh-spark-submit command, you will run the job over a cluster of computers with 20 cores instead of just your local computer. This cluster is fully dedicated to this job, as it will be created after running the command, then deleted once it’s finished.

Another example is the popular word-count use case. Let’s assume you want to calculate the number of words in a big text file, using a cluster of 100 cores. The big text file is stored in OpenStack Swift storage (although it could be any online or cloud storage system). The Spark code for this calculation in Java looks like this:

JavaRDD lines = spark.read().textFile("swift://textfile.abc/novel.txt").javaRDD();

JavaRDD words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List> output = counts.collect();

We can select the desired version of Spark as well. For this example, we’ve selected Spark version 2.4.0, and the command line for running this Spark job looks like this:

./ovh-spark-submit \
	--class JavaWordCount \
	--total-executor-cores 100 \
	--name wordcount1 \
	--version 2.4.0 \
	SparkWordCount-fat.jar

To create our Spark cluster, we use nodes that have four vCores and 15GB of RAM. Therefore, by running this command, a cluster of 26 servers will be created (one for the master node and 25 for workers), so we will have 25×4=100 vCores and 25×15=375GB of RAM.

After running the command line, you will see the progress of creating the cluster and installing all the required software.

Once the cluster is created, you can take a look at it with the official Spark dashboard, and check if your cluster has all 25 workers up and running:

Also, if you go to the OpenStack Horizon dashboard in your OVH cloud account, you will see all 26 servers:

The Apache Spark job will be executed according to the java code-in-jar file that we sent to Spark cluster, and the results will be shown on the screen. Also, the results and the complete log files will be saved in both the local computer and the user’s Swift storage.

Once you’re done, you will see the message that cluster has been deleted, and the addresses of the logs in OpenStack Swift storage and local computer. You can see in the following screenshot that creating a fully installed and configured Spark cluster with 26 servers took less than five minutes.

A bit more about OVH Analytics Data Compute

If you are curious, here are some additional details about Analytics Data Compute:

Everything is built on the OVH Public Cloud, which means everything is powered by OpenStack.
You can choose the Apache Spark version you want to run, directly in the command line. You can also, of course, run multiple clusters with different versions.
A new dedicated cluster will be created for each request, and will be deleted after finishing the job. This means there are no security or privacy issues caused by having multiple users for a single cluster.
You have the option of keeping your cluster after finishing the job. If you add the keep-infra option to the command line, the cluster will not be deleted when you’re done. You can then send more jobs to that cluster or view more details from the logs.
Your cluster computers are created in your own OVH Public Cloud project, so you have full control of your cluster computers.
Results and output logs will be saved in Swift on your OVH Public Cloud project. Only you will have access to them, and you will also have the full history of all your Spark jobs saved in a folder, organised by date and time of execution.
Input and output of data can be any source or format. There is no vendor lock-in when it comes to storage, so you are not forced to only use OVH cloud storage to store your data, and can use any online or cloud storage platform on the public internet.
You can access your Cluster and Spark dashboards and web UIs via HTTPS.

Let’s focus on cluster management systems

In Apache Spark clusters, there are independent processes on all cluster nodes called “executors”, which are coordinated by the driver program. For allocating resources of cluster across applications, the driver program should connect to a cluster management system, after which it will send application code and tasks to executors.

There are several options when it comes to cluster management systems, but to keep things fast and simple, we selected the Spark standalone cluster management system. This offers our users the freedom to choose any version of Spark, and also makes cluster installation faster than the other options. If, for example, we had selected Kubernetes as our cluster management system, our users would have been limited to Spark versions 2.3 or above, and cluster installation would have been more time-consuming. Alternatively, if we wanted to deploy a ready-to-use Kubernetes cluster (like OVH Managed Kubernetes), then we would have lost our scalability, because the infrastructure of our Apache Spark cluster would have been inherently limited by the infrastructure of the Kubernetes cluster. But with our current design, users can have an Apache Spark cluster with as many servers as they like, and the freedom to scale easily.

Try it yourself!

To get started with Analytics Data Compute, you just need to create a cloud account at www.ovh.com, then download the ovh-spark-submit software, and run it as described in the OVH documentation page. Also, if you participate in a short survey on our OVH Labs page, you will receive a voucher, which will let you test Analytics Data Compute first-hand, with 20 euros of free credit.

If you have any questions or would like further explanation, our team is available through our Gitter channel