How to run massive data operations faster than ever, powered by Apache Spark and OVH Analytics Data Compute

If you’re reading this blog for the first time, welcome to the ongoing data revolution! Just after the industrial revolution came what we call the digital revolution, with millions of people and objects accessing a world wide network – the internet – all of them creating new content, new data.

Let’s think about ourselves… We now have smartphones taking pictures and sending texts, sports watches collecting data about our health, Twitter and Instagram accounts generating content, and many other use cases. As a result, data in all its forms is exponentially exploding all over the world.

90% of the total data in the world was generated during last two years. According to IDC, the amount of data in the world is set to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025. When we do a basic division, this represents approximately 34TB of data per person, including all countries and topologies.

Annual size of the global datasphere

Impressive, isn’t it?

This opens up a lot of new concepts and usages, but also, of course, new challenges. How do we store this data? How do we keep it secure and private? And the last but not least, how do we get value from this data, as this new giant datasphere needs to be processed? In other words, it needs to be used to extract values.

Potential results and applications are infinite: improving the agricultural field by analysing weather forecasts, understanding customers deeply, researching new  vaccines, redefining urban environments by analysing traffic jams… The list goes on.

It seems easy at first, but it requires three main elements:

  1. First, we need data. Sometimes these data sources can be heterogeneous (text, audio, video, pictures etc.), and we may need to “clean” them before they can be used efficiently.
  2. Next, we need compute power. Think again about ourselves again: our brains can perform a lot of calculations and operations, but it’s impossible to split one task between multiple brains. Ask a friend to do multiplication with you, and you’ll see this for yourself. With computers though, anything is possible! We are now able to parallelise calculations across multiple computers (i.e. a cluster), allowing us to get the results we want faster than ever.
  3. Last, we need a framework, which is a bunch of tools that allow you to use this datalake and compute power efficiently.
Apache Spark & OVH Analytics Data Compute

How do we build this? Let’s find out together!

Step 1: Find the right framework

As you’ll have seen from the title of this post, it’s not a secret that Apache Spark is our preferred tool at OVH.

We chose Apache Spark because it is an open-source distributed, general-purpose cluster-computing framework that has the largest open-source community in the world of big data, and it is up to 100 times faster than the previous cluster computing framework, Hadoop MapReduce, thanks to nice features like in-memory processing and lazy evaluation. Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning, with an easy-to-use API, and for coding in Spark, you have the option of using different programming languages, including Java, Scala, Python, R and SQL.

Other tools, like Apache Flink and Beam, look very promising as well, and will be part of our upcoming services.

The different components of Apache Spark are:

  • Apache Spark Core, which provides in-memory computing, and forms the basis of other components
  • Spark SQL, which provides structured and semi-structured data abstraction
  • Spark Streaming, which performs streaming analysis using RDD (Resilient Distributed Datasets) transformation
  • MLib (Machine Learning Library), which is a distributed machine learning framework above Spark
  • GraphX, which is a distributed graph processing framework on top of Spark

The Apache Spark architecture principle

Before going further, let’s take the time to understand how Apache Spark can be so fast by reviewing its workflow.

Here is a sample code in Python, where we will read a file and count the number of lines with the letter ‘a’, and the number of lines with the letter ‘b’.

from pyspark import SparkContext
 
logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
 
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
 
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
 
sc.stop()

This code is part of your Spark Application, also known as your Driver Program.

Each action (count(), in our example) will trigger jobs. Apache Spark will then split your work into multiple tasks that can be computed separately.

Apache Spark stores data in RDD (Resilient Distributed Datasets), which is an immutable distributed collection of objects, and then divides it into different logical partitions, so it can process each part in parallel, in different nodes of the cluster.

Task parallelism and in-memory computing are the key to being ultra-fast here. You can go deeper in the official documentation.

Step 2: Find the required compute power

We now have the tools, but they need compute power (we are mainly talking about CPU and RAM memory) to perform such massive operations, and this has to be scalable.

Lets talk about creating a cluster of computers. The old-fashioned way is to buy physical computers and the network equipment to connect them together, installing the OS and all required software and packages, installing Apache Spark on all the nodes, then configuring Spark’s standalone cluster management system and connecting all workers to the master node.

Obviously, this isn’t the best way. It takes a lot of time and needs some skilled engineers to do all the stuff. Also, assume that you did this difficult job and then finished your big data processing… What are you going to do with the cluster after that? Just leave it there or sell it on the second-hand market? What if you decided to perform some larger-scale processing and needed to add more computers to your cluster? You’d need to do all the software and network installations and configuration for new nodes.

A better way of creating a cluster is to use a Public Cloud provider. This way, you will have your servers deployed very quickly, only pay what you consume, and can delete the cluster after finishing your processing task. You’ll also be able to access your data much more easily than you would with an on-premises solution. It’s not a coincidence that, according to IDC, half of the total data in the world will be stored in the public cloud by 2025 [3].

Where is the data stored?

But the main problem persists: you still need to install all the software and packages on each of the servers in your virtual cluster, then configure the network and routers, take security measures and configure the firewall, and finally, install and configure the Apache Spark cluster management system. It will take time and be prone to errors, and the longer it takes, the more you will be charged due to having those servers deployed in your cloud account.

Step 3: Take a rest, and discover OVH Analytics Data Compute

As we’ve just seen, building a cluster can be done manually, but it’s a boring and time-consuming task.

At OVH, we solved this problem by introducing a cluster-computing service called Analytics Data Compute, which will create a 100% ready, fully installed and configured Apache Spark cluster on the fly. By using this service, you don’t need to waste your time on server creation, network, firewalls and security configurations on each node of your cluster. You just focus on your tasks, and the compute cluster you need will appear as if by magic!

In fact, there’s nothing really magic about it… just automations made by OVH to simplify both our our life and yours. We needed this kind of tool internally for large computations, and then crafted it into a product for you.

The concept is quite simple: you launch an Apache Spark job as normal through the command line or API, and a full Apache Spark cluster will be built on the fly, just for your job. Once the processing is done, we delete the cluster and you’re invoiced for the exact resources that were used (on an hourly basis, for now).

This way, we are able to rapidly scale from one to thousands of virtual machines, allowing you to use thousands of CPU cores and thousands GB of memory RAM.

To use Analytics Data Compute, you need to download a small, open-source client software package from OVH repository, called ovh-spark-submit.

This client was made with a view to keeping the official spark-submit command line syntax of Apache Spark. Most of the options and syntax are the same, although the OVH version has some more options related to infrastructure and cluster management. So, this way, you simply request to run your code over your data in a cluster of specific nodes, and the tool will create a cluster with the specified number of nodes, install all packages and software (including Spark and its cluster management system), and then configure the network and firewall. After creating the cluster, OVH Analytics Data Compute will run your Spark code over it, return the result to the user, and then delete the whole thing once it’s done. Much more efficient!

Let’s get it started… Feel the power!

The good news is that If you are already familiar with the spark-submit command line of Apache Spark, you don’t need to learn any new command line tools, as ovh-spark-submit uses almost the exact same options and commands.

Let’s look at an example, where we’ll calculate the famous Pi number’s decimals, first with the original Apache Spark syntax, and then with the ovh-spark-submit client:

./spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

./ovh-spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

You can see that the only difference is “ovh-” at the beginning of the command line, while the rest is the same. And by running the ovh-spark-submit command, you will run the job over a cluster of computers with 20 cores instead of just your local computer. This cluster is fully dedicated to this job, as it will be created after running the command, then deleted once it’s finished.

Another example is the popular word-count use case. Let’s assume you want to calculate the number of words in a big text file, using a cluster of 100 cores. The big text file is stored in OpenStack Swift storage (although it could be any online or cloud storage system). The Spark code for this calculation in Java looks like this:

JavaRDD<String> lines = spark.read().textFile("swift://textfile.abc/novel.txt").javaRDD();

JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List<Tuple2<String, Integer>> output = counts.collect();

We can select the desired version of Spark as well. For this example, we’ve selected Spark version 2.4.0, and the command line for running this Spark job looks like this:

./ovh-spark-submit \
	--class JavaWordCount \
	--total-executor-cores 100 \
	--name wordcount1 \
	--version 2.4.0 \
	SparkWordCount-fat.jar 

To create our Spark cluster, we use nodes that have four vCores and 15GB of RAM. Therefore, by running this command, a cluster of 26 servers will be created (one for the master node and 25 for workers), so we will have 25×4=100 vCores and 25×15=375GB of RAM.

After running the command line, you will see the progress of creating the cluster and installing all the required software.

Once the cluster is created, you can take a look at it with the official Spark dashboard, and check if your cluster has all 25 workers up and running:

Also, if you go to the OpenStack Horizon dashboard in your OVH cloud account, you will see all 26 servers:

The Apache Spark job will be executed according to the java code-in-jar file that we sent to Spark cluster, and the results will be shown on the screen. Also, the results and the complete log files will be saved in both the local computer and the user’s Swift storage.

Once you’re done, you will see the message that cluster has been deleted, and the addresses of the logs in OpenStack Swift storage and local computer. You can see in the following screenshot that creating a fully installed and configured Spark cluster with 26 servers took less than five minutes.

A bit more about OVH Analytics Data Compute

If you are curious, here are some additional details about Analytics Data Compute:

  • Everything is built on the OVH Public Cloud, which means everything is powered by OpenStack.
  • You can choose the Apache Spark version you want to run, directly in the command line. You can also, of course, run multiple clusters with different versions.
  • A new dedicated cluster will be created for each request, and will be deleted after finishing the job. This means there are no security or privacy issues caused by having multiple users for a single cluster.
  • You have the option of keeping your cluster after finishing the job. If you add the keep-infra option to the command line, the cluster will not be deleted when you’re done. You can then send more jobs to that cluster or view more details from the logs.
  • Your cluster computers are created in your own OVH Public Cloud project, so you have full control of your cluster computers.
  • Results and output logs will be saved in Swift on your OVH Public Cloud project. Only you will have access to them, and you will also have the full history of all your Spark jobs saved in a folder, organised by date and time of execution.
  • Input and output of data can be any source or format. There is no vendor lock-in when it comes to storage, so you are not forced to only use OVH cloud storage to store your data, and can use any online or cloud storage platform on the public internet.
  • You can access your Cluster and Spark dashboards and web UIs via HTTPS.

Let’s focus on cluster management systems

In Apache Spark clusters, there are independent processes on all cluster nodes called “executors”, which are coordinated by the driver program. For allocating resources of cluster across applications, the driver program should connect to a cluster management system, after which it will send application code and tasks to executors.

There are several options when it comes to cluster management systems, but to keep things fast and simple, we selected the Spark standalone cluster management system. This offers our users the freedom to choose any version of Spark, and also makes cluster installation faster than the other options. If, for example, we had selected Kubernetes as our cluster management system, our users would have been limited to Spark versions 2.3 or above, and cluster installation would have been more time-consuming. Alternatively, if we wanted to deploy a ready-to-use Kubernetes cluster (like OVH Managed Kubernetes), then we would have lost our scalability, because the infrastructure of our Apache Spark cluster would have been inherently limited by the infrastructure of the Kubernetes cluster. But with our current design, users can have an Apache Spark cluster with as many servers as they like, and the freedom to scale easily.

Try it yourself!

To get started with Analytics Data Compute, you just need to create a cloud account at www.ovh.com, then download the ovh-spark-submit software, and run it as described in the OVH documentation page. Also, if you participate in a short survey on our OVH Labs page, you will receive a voucher, which will let you test Analytics Data Compute first-hand, with 20 euros of free credit.

If you have any questions or would like further explanation, our team is available through our Gitter channel

DevOps at OVHcloud | + posts

DevOps @OVHcloud, Cloud and Data Engineer, Developer Evangelist