Why are you still managing your data processing clusters?

Cluster computing is used to share a computation load among a group of computers. This achieves a higher level of performance and scalability.

Apache Spark is an open-source, distributed and cluster-computing framework, that is much faster than the previous one (Hadoop MapReduce). This is thanks to features like in-memory processing and lazy evaluation. Apache Spark is the most popular tool in this category.

Why are you still managing your data processing clusters?

The analytics engine is the leading framework for large-scale SQL, batch processing, stream processing and machine learning. For coding in Spark, you have the option of using different programming languages; including Java, Scala, Python, R and SQL. It can be run locally on a single machine, or on a cluster of computers for task distribution.

By using Apache Spark, you can process your data in your local computer, or you can create a cluster to send any number of processing jobs.

It is possible to create your cluster with physical computers on-premises, with virtual machines in a hosting company, or with any cloud provider. With your own cluster, you’ll have the ability to send Spark jobs whenever you like.

Cluster Management Challenges

If you are processing a huge amount of data and you expect to have results in a reasonable time, your local computer won’t be enough. You need a cluster of computers to divide the data and process workloads – several computers are run in parallel to speed up the task.

Creating and managing your own cluster of computers, however, is not an easy task. You will face several challenges:

Cluster Creation

Creating an Apache Spark cluster is an arduous task.

First, you’ll need to create a cluster of computers and install an operating system, development tools (Python, Java, Scala), etc.

Second, you’ll need to select a version of Apache Spark and install the necessary nodes (master and workers).

Lastly, you’ll need to connect all these nodes together to finalize your Apache Spark cluster.

All in all, it can take several hours to create and configure a new Apache Spark cluster.

Cluster Management

But once you have your own cluster up and running, your job is far from over. Is your cluster working well? Is each and every node healthy?

Here is the second challenge: facing the pain of cluster management!

You’ll need to check the health of all your nodes manually or, preferably, install monitoring tools that report any issues nodes may encounter.

Do the nodes have enough disk space available for new tasks? One key issue faced by Apache Spark clusters, is that some tasks write a lot of data in the local disk space of nodes without deleting them. Disk space is a common problem and, as you may know, a lack of disk space eliminates the possibility of running more tasks.

Do you need to run multiple Spark jobs at the same time? Sometimes a single job occupies all the CPU and RAM resources in your cluster and doesn’t allow other jobs to start and run at the same time.

These are only a few of the problems you will meet while working with your own clusters.

Cluster Security

Now for the third challenge! What is even more important than having a cluster up and running smoothly?

You guessed it: security. After all, Apache Spark is a Data Processing tool. And data is very sensitive.

Where in your cluster, does security matter most?

What about the connection between nodes? Are they connected with a secured (and fast) connection? Who has access to the servers housing your cluster?

If you have created your cluster on the cloud and you are working with sensitive data, you’ll need to address these issues by securing each and every node and encrypting communications between them.

Spark Version

Here is your fourth challenge: managing your cluster’s user expectations. In some cases this may be a less daunting task, but not all.

There isn’t a whole lot you can do to change the expectations of your cluster’s users, but here’s a common example to help you prepare:

Do your users like to test their codes with different versions of Apache Spark? Or do they require the latest feature from the latest Spark nightly version?

When you create an Apache Spark cluster, you have to select one version of Spark. Your whole cluster will be bound to it, and it alone. This means it isn’t possible for several versions to cohabit in the same cluster.

So, either you’ll have to change the Spark version of your whole cluster or create another separated cluster. And of course, if you decide to do that, you have to create a down time on your cluster to make the modifications.

Cluster Efficiency

And for the final challenge: scaling!

How can you get the most benefit from the cluster resources you are paying for? Are you paying for your cluster but feel you’re not using it efficiently? Is your cluster too big for your users? Is it running, but empty of jobs during the holiday seasons?

When you have a processing cluster – especially if you have a lot of precious resources in your cluster that you’re paying for – you will always have one major concern: is your cluster being utilised as efficiently as possible. There will be occasions that some resources in your cluster are idle, or where you are only running small jobs that don’t require the amount of resources in your cluster. Scaling will become a major pain point.

OVHcloud Data Processing (ODP) Solution

At OVHcloud, we created a new data service called OVHcloud Data Processing (ODP) to address all cluster management challenges mentioned above.

Let’s assume that you have some data to process but you don’t have the desire, the time, the budget or the skills to overcome these challenges. Maybe you don’t want to, or can’t, ask for help from colleagues or consultants to spawn and manage a cluster. How can you still make use of Apache Spark? This is where the ODP service comes in!

By using ODP, you need to write your Apache Spark code and ODP will do the rest. It will create a disposable dedicated Apache Spark cluster over the cloud for each job in just a few seconds – then delete the whole cluster after finishing the job. You only pay for the requested resources and only for the duration of the computation. There is no need to pay for hours and hours of cloud servers, while you are busy with the cluster installation, configuration or even debugging and updating the engine version.

ODP Cluster Creation

When you submit your job, ODP will create an apache spark cluster dedicated to that job in just a few seconds. This cluster will have the amount of CPU and RAM and the number of workers specified in the job submit form. All necessary software will be automatically installed. You don’t need to worry at all about a cluster, how to install, configure, or secure it. ODP does all of this for you.

ODP Cluster Management

When you submit your job, cluster management and monitoring are configured and handled by ODP. All logging and monitoring mechanisms and tools will be installed automatically for you. You will have a Grafana dashboard to monitor different parameters and resources of your job and you will have access to the official Apache Spark dashboard.

You don’t need to worry about cleaning the local disk of each node because each job will start with fresh resources. It isn’t possible, therefore, for one job to delay another job as each job has new, dedicated resources.

ODP Cluster Security

ODP will take care of the security and privacy of your cluster as well. Firstly, all communications between the Spark nodes are encrypted. Secondly, None of your job’s nodes are accessible from the outside. ODP only allows limited ports to be open for your cluster, so that you are still able to load or push your data.

ODP Cluster Spark Version

When it comes to using multiple Spark versions on the same cluster, ODP offers a solution. As every job possesses its own dedicated resources, each job can use any version currently supported by the service, independently from any other job running at the same time. When submitting an Apache Spark job through ODP, you will first select the version of Apache Spark you would like to use. When the Apache Spark community releases a new version, it will soon become available in ODP and you can then submit another job with the new Spark version as well. This means you don’t need to keep updating the Spark version of your whole cluster anymore.

ODP Cluster Efficiency

Each time you submit a job, you’ll have to define exactly how many resources and workers you would like to use for that job. As said earlier, each job has its own dedicated resources so you will be able to have small jobs running alongside much bigger ones. This flexibility, means that you will never have to worry about having an idle cluster. You pay for the resources you use, when you use them.

How to start?

If you’re interested in trying ODP, you can check out: https://www.ovhcloud.com/en/public-cloud/data-processing/ or you can easily create an account at www.ovhcloud.com and select “data processing” in the public cloud section. It is also possible to ask questions directly from the product team in the ODP public gitter channel https://gitter.im/ovh/data-processing.

Conclusion

With ODP, the challenges of running an Apache Spark cluster are removed, or alleviated (we still can’t do much about users’ expectations!) You don’t have to worry about the lack of resources necessary to process your data, or the need to create, install and manage your own cluster.

Focus on your processing algorithm and ODP will do the rest.

Mojtaba Imani

DevOps at OVHcloud | + posts

DevOps @OVHcloud, Cloud and Data Engineer, Developer Evangelist