Big Data Archives - OVHcloud Blog

From 4 days to 15 minutes, a Domain Big Data story

Mathieu Cornic — Fri, 02 Dec 2022 14:27:31 +0000

In our Domain names 101 series, we explained that one ICANN‘s missions is regulating and providing norms for gTLDs domain names (.com, .net, .info …). One of these norms is the Registrar Data Escrow program, also known as RDE.

To put it simply, registrars have to regularly gather and export some data to a third-party company. Since the creation of this program, we have used a solution which has worked very well for many years, but has started to face limitations due to the increasing amount of domain name registrations. Indeed, generating one of these weekly exports lasted 4 days! It was time to design a long-term and scalable solution.

Let’s see together how we went from 4 days of report generation to 15 minutes, a drastic performance improvement of about +38 000%.

What is the Registrar Data Escrow program ?

As we are an ICANN accredited registrar, we have to send regularly a copy of our gTLD domain names data to an escrow agent, a neutral third party. This is a protection mechanism in case of registrar failure, accreditation termination, or accreditation relapse without renewal. So the data associated with registered domain names is never at risk of being lost or inaccessible.

In concrete terms, for all gTLD domain names that we manage for our customers, we send encrypted files containing:

The domain name
The names of the primary and secondary name servers
The expiration date of the domain name
Identity information of the domain name contact and nichandles, which are the registrant, the technical contact, the administrative contact and the billing contact (person or corporation name, postal address, e-mail address and phone/fax number)

Authorized escrow agents are listed on the ICANN website. We chose Denic Escrow Services, a European actor, to ensure our compliance with GDPR.

Registrars with at least 400,000 registrations per year are required to deposit a full export once a week, and an incremental the remaining days. For the others, a weekly full export is sufficient. You can find all technical details in the Registrar Data Escrow Specifications.

Basically a report is composed of two types of encrypted files: Domains & Handles (Contacts).

# Domain CSV file
domain,nameserver-1,nameserver-2,expiration-date,registrant-contact,technical-contact,admin-contact,billing-contact
ovhcloud.com,dns.ovh.net,ns.ovh.net,2049-08-01T00:00:00+00:00,EU_contact/123456,EU_xx12345-ovh,EU_xx12345-ovh,EU_xx12345-ovh

# Handle CSV file
handle,name,address-street1,address-street2,address-street3,address-city,address-state,address-postcode,country-name,email,phone,fax
EU_contact/123456,OVH SAS (Klaba Miroslaw),2 Rue Kellermann,"","",ROUBAIX,"",59100,FR,ovhcloud@fake-email.com,+33.123456789,""
EU_xx12345-ovh,OVH SAS (Klaba Miroslaw),2 Rue Kellermann,"","",ROUBAIX,"",59100,FR,ovhcloud@fake-email.com,+33.123456789,""

Why 4 days of generation ?

Domain names have been for sale on OVHcloud since 2000. This ICANN program is mandatory in the Registrar Accreditation Agreement since 1999, and the last official Registrar Data Escrow Specification document has been published in 2009. This is an old requirement that has been implemented in the early days when our Information System was a full monolith with a few domain names in our databases.

At this time, the report generation had been written using a single Perl script doing simple database requests. Over the years, due to the increase of volume and traffic, new teams, new databases, new subsidiaries and new architectures have been created. Data responsibility was divided into different teams like Domain, DNS and Nichandle/Contact. The internal functioning of the report generation did not change though. Without fundamental refactoring, the time needed to generate the report inherently increased over the years.

It was no big deal while it was working well and ultimately did not last more than 7 days, so that weekly exports did not overlap.

But, in the last 2 years, we went from 2,5 days to 4 days. In this situation, if it fails the last day for any reason, we would be unable to generate and send a new one within the week threshold. In the interim, we also have reached a point where we manage about 5 million of domain names, and we intend to keep growing. We had to anticipate and find a long-term and scalable solution, capable of handling 10 times our current domain name amount.

The internal Data Lake

In the meantime, over the years, OVHcloud built an internal Data Lake, a centralized place where internal production data is replicated. It’s meant to be a reliable, secured and efficient platform for storing Big Data.

Then, after agreements with our internal data governance team, authorized teams from all OVHcloud may use this data for data science. For example, aggregation tables can be used by business teams to make decisions or produce legal documents.

OVHcloud current Data Lake is based on Hortworks HDP. On top of this Data Lake, Big Data tools, like Spark, are configured for high performance. Every day, the Big Data tools compute all businesses data to obtain smart and practical aggregation tables.

In this context, as Domain team, we had the perfect replacement platform. Our Data Lake colleagues provide this efficient platform and a set of ready-to-use tools that we can use as a service to solve this Registrar Data Escrow program problematic.

The solution replacement

As previously said, the data which is required for the report comes from multiple teams and multi-region databases. As replications are centralized in the Data Lake, we have a single source that we can request, join and normalize easily. Moreover, as we don’t request the real time production databases anymore, we reduce impact on production performance.

The Data Lake team provides an Apache Airflow platform where we can schedule Apache Spark scripts through Directed Acyclic Graph workflows. Let’s see that, bit by bit:

The Apache Spark script is the piece of code that will run in the Spark server, making all the joining and normalization of input data. Execution is massively distributed on Hadoop YARN, so that data is computed and fetched in parallel, assuring a short and constant execution time in the end. This is the central piece of the performance improvement.
The script execution is included in a Directed Acyclic Graph workflow, besides other steps like concatenation of result files, working directory removal or uploading final report to Hadoop Distributed File System.
Apache Airflow runs the DAG at the related schedule. It also provides an administration UI, in order to manually run the jobs, display logs and so on.

Finally, the concatenated report is fetched from HDFS, syntactically and semantically validated, encrypted and sent to our Escrow Agent.

Conclusion

We can clearly see the improvement of performance on the full report generation in the following table.

TOTAL	≈4 days	≈15 minutes
	Before	After
Generation	5600 minutes	13 minutes
Validation & Encryption	1 minute	1 minute
Upload	15 seconds	15 seconds

Even though it was pure 20 years legacy, the report generation has been working seamlessly for many years and did the job without us noticing. This is important to say, we should keep normalizing the wait for the right moment to refactor something working, even though it’s bad looking code.

Due to increasingly bad performances, now was the time to change the paradigm and find the right long-term and efficient solution. We chose to go with Big Data technologies to build something ready to face the future. As it’s distributed, report generation should take a constant execution time even with a volume increase. This solution should work seamlessly, in turn, for many years.

On the technological side, we can also tell that Spark works really great. By the way, if you did not know, OVHcloud provides a ready-to-use Data Processing platform which uses Spark clusters.

Finally, as you can see, this article is an overview of what we built together with OVHcloud Data Lake team. But, we did not talk about the Data Lake technologies in details. For more in-depth technical content about Spark, take a look at “How to run massive data operations faster then ever, powered by Apache Spark and OVHcloud Analytics Data Compute” and “Improving the quality of data with Apache Spark” articles 🙂

Why are you still managing your data processing clusters?

Mojtaba Imani — Wed, 30 Sep 2020 16:14:35 +0000

Cluster computing is used to share a computation load among a group of computers. This achieves a higher level of performance and scalability.

Apache Spark is an open-source, distributed and cluster-computing framework, that is much faster than the previous one (Hadoop MapReduce). This is thanks to features like in-memory processing and lazy evaluation. Apache Spark is the most popular tool in this category.

The analytics engine is the leading framework for large-scale SQL, batch processing, stream processing and machine learning. For coding in Spark, you have the option of using different programming languages; including Java, Scala, Python, R and SQL. It can be run locally on a single machine, or on a cluster of computers for task distribution.

By using Apache Spark, you can process your data in your local computer, or you can create a cluster to send any number of processing jobs.

It is possible to create your cluster with physical computers on-premises, with virtual machines in a hosting company, or with any cloud provider. With your own cluster, you’ll have the ability to send Spark jobs whenever you like.

Cluster Management Challenges

If you are processing a huge amount of data and you expect to have results in a reasonable time, your local computer won’t be enough. You need a cluster of computers to divide the data and process workloads – several computers are run in parallel to speed up the task.

Creating and managing your own cluster of computers, however, is not an easy task. You will face several challenges:

Cluster Creation

Creating an Apache Spark cluster is an arduous task.

First, you’ll need to create a cluster of computers and install an operating system, development tools (Python, Java, Scala), etc.

Second, you’ll need to select a version of Apache Spark and install the necessary nodes (master and workers).

Lastly, you’ll need to connect all these nodes together to finalize your Apache Spark cluster.

All in all, it can take several hours to create and configure a new Apache Spark cluster.

Cluster Management

But once you have your own cluster up and running, your job is far from over. Is your cluster working well? Is each and every node healthy?

Here is the second challenge: facing the pain of cluster management!

You’ll need to check the health of all your nodes manually or, preferably, install monitoring tools that report any issues nodes may encounter.

Do the nodes have enough disk space available for new tasks? One key issue faced by Apache Spark clusters, is that some tasks write a lot of data in the local disk space of nodes without deleting them. Disk space is a common problem and, as you may know, a lack of disk space eliminates the possibility of running more tasks.

Do you need to run multiple Spark jobs at the same time? Sometimes a single job occupies all the CPU and RAM resources in your cluster and doesn’t allow other jobs to start and run at the same time.

These are only a few of the problems you will meet while working with your own clusters.

Cluster Security

Now for the third challenge! What is even more important than having a cluster up and running smoothly?

You guessed it: security. After all, Apache Spark is a Data Processing tool. And data is very sensitive.

Where in your cluster, does security matter most?

What about the connection between nodes? Are they connected with a secured (and fast) connection? Who has access to the servers housing your cluster?

If you have created your cluster on the cloud and you are working with sensitive data, you’ll need to address these issues by securing each and every node and encrypting communications between them.

Spark Version

Here is your fourth challenge: managing your cluster’s user expectations. In some cases this may be a less daunting task, but not all.

There isn’t a whole lot you can do to change the expectations of your cluster’s users, but here’s a common example to help you prepare:

Do your users like to test their codes with different versions of Apache Spark? Or do they require the latest feature from the latest Spark nightly version?

When you create an Apache Spark cluster, you have to select one version of Spark. Your whole cluster will be bound to it, and it alone. This means it isn’t possible for several versions to cohabit in the same cluster.

So, either you’ll have to change the Spark version of your whole cluster or create another separated cluster. And of course, if you decide to do that, you have to create a down time on your cluster to make the modifications.

Cluster Efficiency

And for the final challenge: scaling!

How can you get the most benefit from the cluster resources you are paying for? Are you paying for your cluster but feel you’re not using it efficiently? Is your cluster too big for your users? Is it running, but empty of jobs during the holiday seasons?

When you have a processing cluster – especially if you have a lot of precious resources in your cluster that you’re paying for – you will always have one major concern: is your cluster being utilised as efficiently as possible. There will be occasions that some resources in your cluster are idle, or where you are only running small jobs that don’t require the amount of resources in your cluster. Scaling will become a major pain point.

OVHcloud Data Processing (ODP) Solution

At OVHcloud, we created a new data service called OVHcloud Data Processing (ODP) to address all cluster management challenges mentioned above.

Let’s assume that you have some data to process but you don’t have the desire, the time, the budget or the skills to overcome these challenges. Maybe you don’t want to, or can’t, ask for help from colleagues or consultants to spawn and manage a cluster. How can you still make use of Apache Spark? This is where the ODP service comes in!

By using ODP, you need to write your Apache Spark code and ODP will do the rest. It will create a disposable dedicated Apache Spark cluster over the cloud for each job in just a few seconds – then delete the whole cluster after finishing the job. You only pay for the requested resources and only for the duration of the computation. There is no need to pay for hours and hours of cloud servers, while you are busy with the cluster installation, configuration or even debugging and updating the engine version.

ODP Cluster Creation

When you submit your job, ODP will create an apache spark cluster dedicated to that job in just a few seconds. This cluster will have the amount of CPU and RAM and the number of workers specified in the job submit form. All necessary software will be automatically installed. You don’t need to worry at all about a cluster, how to install, configure, or secure it. ODP does all of this for you.

ODP Cluster Management

When you submit your job, cluster management and monitoring are configured and handled by ODP. All logging and monitoring mechanisms and tools will be installed automatically for you. You will have a Grafana dashboard to monitor different parameters and resources of your job and you will have access to the official Apache Spark dashboard.

You don’t need to worry about cleaning the local disk of each node because each job will start with fresh resources. It isn’t possible, therefore, for one job to delay another job as each job has new, dedicated resources.

ODP Cluster Security

ODP will take care of the security and privacy of your cluster as well. Firstly, all communications between the Spark nodes are encrypted. Secondly, None of your job’s nodes are accessible from the outside. ODP only allows limited ports to be open for your cluster, so that you are still able to load or push your data.

ODP Cluster Spark Version

When it comes to using multiple Spark versions on the same cluster, ODP offers a solution. As every job possesses its own dedicated resources, each job can use any version currently supported by the service, independently from any other job running at the same time. When submitting an Apache Spark job through ODP, you will first select the version of Apache Spark you would like to use. When the Apache Spark community releases a new version, it will soon become available in ODP and you can then submit another job with the new Spark version as well. This means you don’t need to keep updating the Spark version of your whole cluster anymore.

ODP Cluster Efficiency

Each time you submit a job, you’ll have to define exactly how many resources and workers you would like to use for that job. As said earlier, each job has its own dedicated resources so you will be able to have small jobs running alongside much bigger ones. This flexibility, means that you will never have to worry about having an idle cluster. You pay for the resources you use, when you use them.

How to start?

If you’re interested in trying ODP, you can check out: https://www.ovhcloud.com/en/public-cloud/data-processing/ or you can easily create an account at www.ovhcloud.com and select “data processing” in the public cloud section. It is also possible to ask questions directly from the product team in the ODP public gitter channel https://gitter.im/ovh/data-processing.

Conclusion

With ODP, the challenges of running an Apache Spark cluster are removed, or alleviated (we still can’t do much about users’ expectations!) You don’t have to worry about the lack of resources necessary to process your data, or the need to create, install and manage your own cluster.

Focus on your processing algorithm and ODP will do the rest.

Do you need to process your data? Try the new OVHcloud Data Processing service!

Mojtaba Imani — Wed, 22 Jul 2020 12:55:45 +0000

Today, we are generating more data than ever. 90 percent of global data has been generated in the last 2 years. By 2025 the amount of data in the world is estimated to reach 175 Zettabyte. In total, people write 500 million tweets per day, and autonomous cars generate 20TB of data every hour. By the year 2025 more than 75 billion IoT devices will be connected to the web, which will all generate data. Nowadays, devices and services that generate data are everywhere. .

There is also the notion of data exhaust which is the by-product of people’s online activities. It’s the data that’s generated as a result of someone visiting a website, buying a product or searching for something using a search engine. You may have heard this data described as Metadata.

We will start drowning in a flood of data unless we learn how to swim – how to benefit from vast amounts of data. To do this we need to be able to process the data for the sake of better decision-making, preventing fraud and danger, inventing better products or even predicting the future. The possibilities are endless.

But how can we process this huge amount of data? For sure, it’s not possible to do it the old fashioned way. We need to upgrade our methods and equipment.

Big data are data sets that have a volume, velocity and variety too large to be processed by a local computer. So what are the requirements for processing “big data”?

1- Process data in parallel

Data is everywhere and available in huge quantities. First off, lets apply the old rule: ‘Divide and Conquer’.

Dividing the data means that we will need data and processing tasks to be distributed across several computers. These will need to be set in a cluster, to perform these different tasks in parallel and gain reasonable performance and speed boosts.

Lets assume that you needed to find out whats trending on twitter. You would have to process around 500 million tweets with one computer in one hour. Not so easy now, is it? And how would you benefit if it took a month to process? What is the value in finding the trend of the day, one month later?

Parallelization is more than a “nice to have” feature. It’s a requirement!

2- Process data in the cloud

The second step is to create and manage these clusters in an efficient way.

You have several choices here, like creating clusters with your own servers and managing them by yourself. But that’s time consuming and costs quite a lot as well. It also lacks some features you may wish to use, like flexibility. For these reasons, the cloud appears to be a better and better solution every day for a lot of companies.

The elasticity that cloud solutions provide, helps companies to be flexible and adapt infrastructure to their needs. With data processing, for example, we will need to be able to scale up and down our computing cluster easily to adapt the computing power to the volume of data we want to process according to our constraints (time, costs, etc.).

And then, even if you decide to use a cloud provider, you will have several solutions to choose from, each with their own drawbacks. One of these solutions, is to create a computing cluster over the dedicated servers or public cloud instances and send different processing jobs to the cluster. The main drawback, in this solution, would be that if no processing is done, you’re still paying for the reserved but unused processing resources.

A more efficient way would therefore be to create a dedicated cluster for each processing job, with the right resources for that job, and then delete the cluster after. Each new job would have it’s own cluster, sized as needed, spawned on demand. But this solution would only be feasible if the creation of a computing cluster took but a few seconds and not minutes or hours.

Data locality

When creating a processing cluster, it is also possible to consider data locality. Here, cloud providers usually offer several regions spread across data centers situated in different countries. It has two main benefits:

The first one is not directly linked to data locality but more of a legal point. Depending on where your customers are, and where your data is, you may need to comply with local data privacy laws and regulations. You may need to keep your data in a specific region or country and not be able to process it outside. So, to create a cluster of computers in that region, it is easier to process data while complying with local privacy policies.

The second benefit is, of course, the potential to create your processing clusters in close physical proximity to your data. According to estimations, by the year 2025, almost 50 percent of data in the world will be stored in the cloud. On-premises data storing is in decline.

Therefore, using cloud providers that have several regions gives companies the benefit of having processing clusters near their datas physical location – this greatly reduces the time (and costs!) it takes to fetch data.

While processing your data in the cloud may not be a requirement per se, its certainly more beneficial than doing it yourself.

3- Process data with the most appropriate distributed computing technology

The third and final step, is to decide how you are going to process your data, meaning with what tools. Again, you could do it by yourself, by implementing a distributed processing engine in a language of your choice. But where’s the fun in that? (okay, for some of us it might actually be quite fun!)

But it would be astronomically complex. You would need to write code to divide the data into several parts and send each part to a computer in your cluster. Each computer would then process its part of data, and you would need to find a way to retrieve the results of each part and to re-aggregate everything into a coherent result. In short, it would be a lot of work, with a lot of debugging.

Apache Spark

But there are technologies that have been developed specifically for this purpose. They distribute the data and processing tasks automatically and retrieve the results for you. Currently, the most popular distributed computing technology, especially in relation to Data Science subjects, is Apache Spark.

Apache Spark is an open-source, distributed, cluster-computing framework. It is much faster than the previous one, Hadoop MapReduce, thanks to features like in-memory processing and lazy evaluation.

Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning. For coding in Apache Spark, you have the option of using different programming languages (including Java, Scala, Python, R and SQL). It can run locally in a single machine or in a cluster of computers to distribute its tasks.

As you can see in the google trends data chart above, there are alternatives. But Apache Spark has definitely established itself as the leader in distributed computing tools.

OVHcloud Data Processing (ODP)

OVHcloud is the leader of European hosting and cloud providers with a wide range of cloud services like public and private cloud, managed Kubernetes and cloud storage. But besides all the hosting and cloud services, OVHcloud also provides a range of big data analytics and artificial intelligence services as well as platforms.

One of the data services offered by OVHcloud is OVHcloud Data Processing (ODP). It is a service that allows you to submit a processing job without worrying about the cluster behind it. You just have to specify the resources you need for the job and the service will abstract the cluster creation, and destroy it for you as soon as your job finishes. In other words, you don’t have to think about clusters any more. Decide on how many resources you will need to process your data in an efficient way, and then let OVHcloud Data Processing do the rest.

On-demand, job-specific Spark clusters

The service will deploy a temporary, job-specific Apache Spark Cluster, then configure and secure it automatically. You don’t need to have any prior knowledge or skills related to the cloud, networking, cluster management systems, security, etc. You only have to focus on your processing algorithm and Apache Spark code.

This service will download your Apache Spark code from one of your Object Storage containers, and ask you how much RAM and CPU cores you would like your job to use. You will also have to specify the region you want the processing to take place in. Last but not least, you will then have to choose the Apache Spark version you want to use to run your code. The service will then launch your job within a few seconds, according to the specified parameters until your job’s completion. Nothing else to do on your part. No cluster creation, no cluster destruction. Just focus on your code.

Your local computer resources no longer limit the amount of data you can process. You can run any number of processing jobs in parallel in any region and any version of Spark. Its also very fast and very easy.

How does it work ?

On your side You just need to:

Create a container in OVHcloud Object Storage and upload the Apache Spark code and any other required files to this container. Be careful not to put your data in the same container as well, as the whole container will be downloaded by the service.
You then have to define the processing engine (like Apache Spark) and its version, as well as the geographical region and the amount of resources (CPU cores, RAM and number of worker nodes) you need. There are three different ways to execute this (OVHcloud Control Panel, API or ODP CLI)

These are different steps that happen when you run a processing job in OVHcloud Data Processing (ODP) platform:

ODP will take over and handle the deployment and execution of your job according to the specifications that you defined.
Before starting your job, ODP will download all files that you uploaded in the specified container.
Next, ODP will run your job in a dedicated environment, created specifically for your job. Apart from a limitation on the available ports (list available here), your job can then connect to any data source (databases, object storage, etc) to read or write data (as long as they are reachable through the Internet)
When the job is complete, ODP stores the execution output logs to your Object Storage and then deletes the whole cluster immediately.
You will be charged for the amount of resources you specified and only for the duration of your job computation, on a per-minutes basis.

Different ways to submit a job?

There are three different ways that you can submit a processing job to ODP, depending on your requirements. These three ways are OVHcloud Manager, OVHcloud API and CLI (Command Line Interface).

1. OVHcloud Manager

To submit a job with OVHcloud Manager you need to go to OVHcloud.com and login with your OVHcloud account (or create one if neccessary). Then go to the “Public Cloud” page and select the “Data Processing” link on the left panel and submit a job by clicking on “Start a new job”.

Before submitting a job you need to create a container in OVHcloud Object Storage by clicking on “Object Storage” link on the left panel and upload your Apache Spark code and any other required files.

2. OVHcloud API

You can submit a job to ODP by using OVHcloud API. For more information, you can see the OVHcloud API web page https://api.ovh.com/. You can create job submit automation by using ODP API.

3. ODP CLI (Command Line Interface)

ODP has an open source Command Line Interface that you can find in OVH public GitHub at https://github.com/ovh/data-processing-spark-submit). By using CLI, you can upload your files and codes and create your Apache Spark cluster together with just one command.

Some ODP benefits

You can either always run your processing tasks in your local computer, or you can create an Apache Spark cluster in your local premises with any cloud provider. This means you can manage that cluster yourself or using similar services from other competitors. But ODP has several benefits, it is good to have them in your mind when deciding on a solution:

No cluster management or configuration skills or experience is needed.
Not limited by resources and easy and fast. (The only limit is your cloud account quota)
Pay as you go model with easy pricing and no hidden cost. (per-minutes billing)
Per job resource definition (no more resources lost compared to a mutualised cluster)
Ease of managing Apache Spark version (You select the version for each job and you can even have different jobs with different versions of the Apache Spark at the same time)
Region selection (You can select different regions based on your data locality or data privacy policy)
Start a Data Processing job in just a few seconds
Real-time logs (when your job is running, you will receive real-time logs in your Customer Panel)
Full output log will be available just after finishing the job (some competitors take minutes to deliver logs to you)
Job submit automation (by using ODP API or CLI)
Data Privacy (OVHcloud is European company and all customers are strictly protected by European GDPR)

Conclusion

With the advance of new technologies and devices, we are flooded with data. More and more, it is essential for businesses and for academic research to process data sets and understand where the value is. By providing the OVHcloud Data Processing (ODP) service, our goal is to provide you with one of the easiest and most efficient platforms to process your data. Just focus on your processing algorithm and ODP will handle the rest for you.

Contributing to Apache HBase: custom data balancing

Pierre Zemb — Fri, 14 Feb 2020 16:37:19 +0000

In today’s blogpost, we’re going to take a look at our upstream contribution to Apache HBase’s stochastic load balancer, based on our experience of running HBase clusters to support OVHcloud’s monitoring.

The context

Have you ever wondered how:

we generate the graphs for your OVHcloud server or web hosting package?
our internal teams monitor their own servers and applications?

All internal teams are constantly gathering telemetry and monitoring data and sending them to a dedicated team, who are responsible for handling all the metrics and logs generated by OVHcloud’s infrastructure: the Observability team.

We tried a lot of different Time Series databases, and eventually chose Warp10 to handle our workloads. Warp10 can be integrated with the various big-data solutions provided by the Apache Foundation. In our case, we use Apache HBase as the long-term storage datastore for our metrics.

Apache HBase, a datastore built on top of Apache Hadoop, provides an elastic, distributed, key-ordered map. As such, one of the key features of Apache HBase for us is the ability to scan, i.e. retrieve a range of keys. Thanks to this feature, we can fetch thousands of datapoints in an optimised way.

We have our own dedicated clusters, the biggest of which has more than 270 nodes to spread our workloads:

between 1.6 and 2 million writes per second, 24/7
between 4 and 6 million reads per second
around 300TB of telemetry, stored within Apache HBase

As you can probably imagine, storing 300TB of data in 270 nodes comes with some challenges regarding repartition, as every bit is hot data, and should be accessible at any time. Let’s dive in!

How does balancing work in Apache HBase?

Before diving into the balancer, let’s take a look at how it works. In Apache HBase, data is split into shards called Regions, and distributed through RegionServers. The number of regions will increase as the data is coming in, and regions will be split as a result. This is where the Balancer comes in. It will move regions to avoid hotspotting a single RegionServer and effectively distribute the load.

The actual implementation, called StochasticBalancer, uses a cost-based approach:

It first computes the overall cost of the cluster, by looping through cost functions. Every cost function returns a number between 0 and 1 inclusive, where 0 is the lowest cost-best solution, and 1 is the highest possible cost and worst solution. Apache Hbase is coming with several cost functions, which are measuring things like region load, table load, data locality, number of regions per RegionServers… The computed costs are scaled by their respective coefficients, defined in the configuration.
Now that the initial cost is computed, we can try to Mutate our cluster. For this, the Balancer creates a random nextAction, which could be something like swapping two regions, or moving one region to another RegionServer. The action is applied virtually , and then the new cost is calculated. If the new cost is lower than our previous one, the action is stored. If not, it is skipped. This operation is repeated thousands of times, hence the Stochastic.
At the end, the list of valid actions is applied to the actual cluster.

What was not working for us?

We found out that for our specific use case, which involved:

Single table
Dedicated Apache HBase and Apache Hadoop, tailored for our requirements
Good key distribution

… the number of regions per RegionServer was the real limit for us.

Even if the balancing strategy seems simple, we do think that being able to run an Apache HBase cluster on heterogeneous hardware is vital, especially in cloud environments, because you may not be able to buy the same server specs again in the future. In our earlier example, our cluster grew from 80 to ~250 machines in four years. Throughout that time, we bought new dedicated server references, and even tested some special internal references.

We ended-up with differents groups of hardware: some servers can handle only 180 regions, whereas the biggest can handle more than 900. Because of this disparity, we had to disable the Load Balancer to avoid the RegionCountSkewCostFunction, which would try to bring all RegionServers to the same number of regions.

Two years ago we developed some internal tools, which are responsible for load balancing regions across RegionServers. The tooling worked really good for our use case, simplifying the day-to-day operation of our cluster.

Open source is at the DNA of OVHcloud, and that means that we build our tools on open source software, but also that we contribute and give it back to the community. When we talked around, we saw that we weren’t the only one concerned by the heterogenous cluster problem. We decided to rewrite our tooling to make it more general, and to contribute it directly upstream to the HBase project.

Our contributions

The first contribution was pretty simple, the cost function list was a constant. We added the possibility to load custom cost functions.

The second contribution was about adding an optional costFunction to balance regions according to a capacity rule.

How does it works?

The balancer will load a file containing lines of rules. A rule is composed of a regexp for hostname, and a limit. For example, we could have:

rs[0-9] 200
rs1[0-9] 50

RegionServers with hostnames matching the first rules will have a limit of 200, and the others 50. If there’s no match, a default is set.

Thanks to these rule, we have two key pieces of information:

the max number of regions for this cluster
the rules for each servers

The HeterogeneousRegionCountCostFunction will try to balance regions, according to their capacity.

Let’s take an example… Imagine that we have 20 RS:

10 RS, named rs0 to rs9, loaded with 60 regions each, which can each handle 200 regions.
10 RS, named rs10 to rs19, loaded with 60 regions each, which can each handle 50 regions.

So, based on the following rules:

rs[0-9] 200
rs1[0-9] 50

… we can see that the second group is overloaded, whereas the first group has plenty of space.

We know that we can handle a maximum of 2,500 regions (200×10 + 50×10), and we have currently 1,200 regions (60×20). As such, the HeterogeneousRegionCountCostFunction will understand that the cluster is full at 48.0% (1200/2500). Based on this information, we will then try to put all the RegionServers at ~48% of the load, according to the rules.

Where to next?

Thanks to Apache HBase’s contributors, our patches are now merged into the master branch. As soon as Apache HBase maintainers publish a new release, we will deploy and use it at scale. This will allow more automation on our side, and ease operations for the Observability Team.

Contributing was an awesome journey. What I love most about open source is the opportunity ability to contribute back, and build stronger software. We had an opinion about how a particular issue should addressed, but the discussions with the community helped us to refine it. We spoke with engineers from other companies, who were struggling with Apache HBase’s cloud deployments, just as we were, and thanks to those exchanges, our contribution became more and more relevant.

How to run massive data operations faster than ever, powered by Apache Spark and OVH Analytics Data Compute

Mojtaba Imani — Mon, 27 May 2019 11:48:26 +0000

If you’re reading this blog for the first time, welcome to the ongoing data revolution! Just after the industrial revolution came what we call the digital revolution, with millions of people and objects accessing a world wide network – the internet – all of them creating new content, new data.

Let’s think about ourselves… We now have smartphones taking pictures and sending texts, sports watches collecting data about our health, Twitter and Instagram accounts generating content, and many other use cases. As a result, data in all its forms is exponentially exploding all over the world.

90% of the total data in the world was generated during last two years. According to IDC, the amount of data in the world is set to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025. When we do a basic division, this represents approximately 34TB of data per person, including all countries and topologies.

Impressive, isn’t it?

This opens up a lot of new concepts and usages, but also, of course, new challenges. How do we store this data? How do we keep it secure and private? And the last but not least, how do we get value from this data, as this new giant datasphere needs to be processed? In other words, it needs to be used to extract values.

Potential results and applications are infinite: improving the agricultural field by analysing weather forecasts, understanding customers deeply, researching new vaccines, redefining urban environments by analysing traffic jams… The list goes on.

It seems easy at first, but it requires three main elements:

First, we need data. Sometimes these data sources can be heterogeneous (text, audio, video, pictures etc.), and we may need to “clean” them before they can be used efficiently.
Next, we need compute power. Think again about ourselves again: our brains can perform a lot of calculations and operations, but it’s impossible to split one task between multiple brains. Ask a friend to do multiplication with you, and you’ll see this for yourself. With computers though, anything is possible! We are now able to parallelise calculations across multiple computers (i.e. a cluster), allowing us to get the results we want faster than ever.
Last, we need a framework, which is a bunch of tools that allow you to use this datalake and compute power efficiently.

How do we build this? Let’s find out together!

Step 1: Find the right framework

As you’ll have seen from the title of this post, it’s not a secret that Apache Spark is our preferred tool at OVH.

We chose Apache Spark because it is an open-source distributed, general-purpose cluster-computing framework that has the largest open-source community in the world of big data, and it is up to 100 times faster than the previous cluster computing framework, Hadoop MapReduce, thanks to nice features like in-memory processing and lazy evaluation. Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing and machine learning, with an easy-to-use API, and for coding in Spark, you have the option of using different programming languages, including Java, Scala, Python, R and SQL.

Other tools, like Apache Flink and Beam, look very promising as well, and will be part of our upcoming services.

The different components of Apache Spark are:

Apache Spark Core, which provides in-memory computing, and forms the basis of other components
Spark SQL, which provides structured and semi-structured data abstraction
Spark Streaming, which performs streaming analysis using RDD (Resilient Distributed Datasets) transformation
MLib (Machine Learning Library), which is a distributed machine learning framework above Spark
GraphX, which is a distributed graph processing framework on top of Spark

The Apache Spark architecture principle

Before going further, let’s take the time to understand how Apache Spark can be so fast by reviewing its workflow.

Here is a sample code in Python, where we will read a file and count the number of lines with the letter ‘a’, and the number of lines with the letter ‘b’.

from pyspark import SparkContext
 
logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
 
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
 
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
 
sc.stop()

This code is part of your Spark Application, also known as your Driver Program.

Each action (count(), in our example) will trigger jobs. Apache Spark will then split your work into multiple tasks that can be computed separately.

Apache Spark stores data in RDD (Resilient Distributed Datasets), which is an immutable distributed collection of objects, and then divides it into different logical partitions, so it can process each part in parallel, in different nodes of the cluster.

Task parallelism and in-memory computing are the key to being ultra-fast here. You can go deeper in the official documentation.

Step 2: Find the required compute power

We now have the tools, but they need compute power (we are mainly talking about CPU and RAM memory) to perform such massive operations, and this has to be scalable.

Lets talk about creating a cluster of computers. The old-fashioned way is to buy physical computers and the network equipment to connect them together, installing the OS and all required software and packages, installing Apache Spark on all the nodes, then configuring Spark’s standalone cluster management system and connecting all workers to the master node.

Obviously, this isn’t the best way. It takes a lot of time and needs some skilled engineers to do all the stuff. Also, assume that you did this difficult job and then finished your big data processing… What are you going to do with the cluster after that? Just leave it there or sell it on the second-hand market? What if you decided to perform some larger-scale processing and needed to add more computers to your cluster? You’d need to do all the software and network installations and configuration for new nodes.

A better way of creating a cluster is to use a Public Cloud provider. This way, you will have your servers deployed very quickly, only pay what you consume, and can delete the cluster after finishing your processing task. You’ll also be able to access your data much more easily than you would with an on-premises solution. It’s not a coincidence that, according to IDC, half of the total data in the world will be stored in the public cloud by 2025 [3].

But the main problem persists: you still need to install all the software and packages on each of the servers in your virtual cluster, then configure the network and routers, take security measures and configure the firewall, and finally, install and configure the Apache Spark cluster management system. It will take time and be prone to errors, and the longer it takes, the more you will be charged due to having those servers deployed in your cloud account.

Step 3: Take a rest, and discover OVH Analytics Data Compute

As we’ve just seen, building a cluster can be done manually, but it’s a boring and time-consuming task.

At OVH, we solved this problem by introducing a cluster-computing service called Analytics Data Compute, which will create a 100% ready, fully installed and configured Apache Spark cluster on the fly. By using this service, you don’t need to waste your time on server creation, network, firewalls and security configurations on each node of your cluster. You just focus on your tasks, and the compute cluster you need will appear as if by magic!

In fact, there’s nothing really magic about it… just automations made by OVH to simplify both our our life and yours. We needed this kind of tool internally for large computations, and then crafted it into a product for you.

The concept is quite simple: you launch an Apache Spark job as normal through the command line or API, and a full Apache Spark cluster will be built on the fly, just for your job. Once the processing is done, we delete the cluster and you’re invoiced for the exact resources that were used (on an hourly basis, for now).

This way, we are able to rapidly scale from one to thousands of virtual machines, allowing you to use thousands of CPU cores and thousands GB of memory RAM.

To use Analytics Data Compute, you need to download a small, open-source client software package from OVH repository, called ovh-spark-submit.

This client was made with a view to keeping the official spark-submit command line syntax of Apache Spark. Most of the options and syntax are the same, although the OVH version has some more options related to infrastructure and cluster management. So, this way, you simply request to run your code over your data in a cluster of specific nodes, and the tool will create a cluster with the specified number of nodes, install all packages and software (including Spark and its cluster management system), and then configure the network and firewall. After creating the cluster, OVH Analytics Data Compute will run your Spark code over it, return the result to the user, and then delete the whole thing once it’s done. Much more efficient!

Let’s get it started… Feel the power!

The good news is that If you are already familiar with the spark-submit command line of Apache Spark, you don’t need to learn any new command line tools, as ovh-spark-submit uses almost the exact same options and commands.

Let’s look at an example, where we’ll calculate the famous Pi number’s decimals, first with the original Apache Spark syntax, and then with the ovh-spark-submit client:

./spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

./ovh-spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--total-executor-cores 20 \
	SparkPI.jar 100

You can see that the only difference is “ovh-” at the beginning of the command line, while the rest is the same. And by running the ovh-spark-submit command, you will run the job over a cluster of computers with 20 cores instead of just your local computer. This cluster is fully dedicated to this job, as it will be created after running the command, then deleted once it’s finished.

Another example is the popular word-count use case. Let’s assume you want to calculate the number of words in a big text file, using a cluster of 100 cores. The big text file is stored in OpenStack Swift storage (although it could be any online or cloud storage system). The Spark code for this calculation in Java looks like this:

JavaRDD lines = spark.read().textFile("swift://textfile.abc/novel.txt").javaRDD();

JavaRDD words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD counts = ones.reduceByKey((i1, i2) -> i1 + i2);
List> output = counts.collect();

We can select the desired version of Spark as well. For this example, we’ve selected Spark version 2.4.0, and the command line for running this Spark job looks like this:

./ovh-spark-submit \
	--class JavaWordCount \
	--total-executor-cores 100 \
	--name wordcount1 \
	--version 2.4.0 \
	SparkWordCount-fat.jar

To create our Spark cluster, we use nodes that have four vCores and 15GB of RAM. Therefore, by running this command, a cluster of 26 servers will be created (one for the master node and 25 for workers), so we will have 25×4=100 vCores and 25×15=375GB of RAM.

After running the command line, you will see the progress of creating the cluster and installing all the required software.

Once the cluster is created, you can take a look at it with the official Spark dashboard, and check if your cluster has all 25 workers up and running:

Also, if you go to the OpenStack Horizon dashboard in your OVH cloud account, you will see all 26 servers:

The Apache Spark job will be executed according to the java code-in-jar file that we sent to Spark cluster, and the results will be shown on the screen. Also, the results and the complete log files will be saved in both the local computer and the user’s Swift storage.

Once you’re done, you will see the message that cluster has been deleted, and the addresses of the logs in OpenStack Swift storage and local computer. You can see in the following screenshot that creating a fully installed and configured Spark cluster with 26 servers took less than five minutes.

A bit more about OVH Analytics Data Compute

If you are curious, here are some additional details about Analytics Data Compute:

Everything is built on the OVH Public Cloud, which means everything is powered by OpenStack.
You can choose the Apache Spark version you want to run, directly in the command line. You can also, of course, run multiple clusters with different versions.
A new dedicated cluster will be created for each request, and will be deleted after finishing the job. This means there are no security or privacy issues caused by having multiple users for a single cluster.
You have the option of keeping your cluster after finishing the job. If you add the keep-infra option to the command line, the cluster will not be deleted when you’re done. You can then send more jobs to that cluster or view more details from the logs.
Your cluster computers are created in your own OVH Public Cloud project, so you have full control of your cluster computers.
Results and output logs will be saved in Swift on your OVH Public Cloud project. Only you will have access to them, and you will also have the full history of all your Spark jobs saved in a folder, organised by date and time of execution.
Input and output of data can be any source or format. There is no vendor lock-in when it comes to storage, so you are not forced to only use OVH cloud storage to store your data, and can use any online or cloud storage platform on the public internet.
You can access your Cluster and Spark dashboards and web UIs via HTTPS.

Let’s focus on cluster management systems

In Apache Spark clusters, there are independent processes on all cluster nodes called “executors”, which are coordinated by the driver program. For allocating resources of cluster across applications, the driver program should connect to a cluster management system, after which it will send application code and tasks to executors.

There are several options when it comes to cluster management systems, but to keep things fast and simple, we selected the Spark standalone cluster management system. This offers our users the freedom to choose any version of Spark, and also makes cluster installation faster than the other options. If, for example, we had selected Kubernetes as our cluster management system, our users would have been limited to Spark versions 2.3 or above, and cluster installation would have been more time-consuming. Alternatively, if we wanted to deploy a ready-to-use Kubernetes cluster (like OVH Managed Kubernetes), then we would have lost our scalability, because the infrastructure of our Apache Spark cluster would have been inherently limited by the infrastructure of the Kubernetes cluster. But with our current design, users can have an Apache Spark cluster with as many servers as they like, and the freedom to scale easily.

Try it yourself!

To get started with Analytics Data Compute, you just need to create a cloud account at www.ovh.com, then download the ovh-spark-submit software, and run it as described in the OVH documentation page. Also, if you participate in a short survey on our OVH Labs page, you will receive a voucher, which will let you test Analytics Data Compute first-hand, with 20 euros of free credit.

If you have any questions or would like further explanation, our team is available through our Gitter channel

Understanding CI/CD for Big Data and Machine Learning

Yvonnick Esnault — Thu, 14 Feb 2019 12:28:36 +0000

This week, the OVH Integration and Continuous Deployment team was invited to the DataBuzzWord podcast.

Together, we explored the topic of continuous deployment in the context of machine learning and big data. We also discussed continuous deployment for environments like Kubernetes, Docker, OpenStack and VMware VSphere.

If you missed it, or would like to review everything that was discussed, you can listen to it again here. We hope to return soon, to continue sharing our passion for testing, integration and continuous deployment.

Although the podcast was recorded in French, starting from tomorrow, we’ll be delving further into the key points of our discussion in a series of articles on this blog.

Find CDS on GitHub:

https://github.com/ovh/cds

…. and follow us on Twitter:

Come chat about these subjects with us on our Gitter channel: https://gitter.im/ovh-cds/