migrate-datacentre -quiet: How do we seamlessly migrate a datacentre?

In our previous articles, we explained why we had to move 800,000 databases from one datacentre to another, 300 kilometres away. So here we are… We did it!
Now we will focus on a very important goal of any migration. From your point of view, as a customer, you must see… Nothing! I understand this completely, as I am also an OVH customer. And as a customer paying for a service, I want it to work, regardless of what the people behind the scenes do.

In this post, we will see how we managed to move all of them (almost) seamlessly…

The basics

How does a website connect to a database that’s hosted on a different server?
Short answer: by using the network.
As a diagram is always better than a long answer, here is a simplified view of how it works…

The webserver checks the website’s configuration to determine where the database is located:

$database_server = 'mydatabase.mysql.db';
$database_name = 'mydatabase';
$database_user = 'mydatabase';
$database_password = 'correct horse battery staple';
# ref: https://www.xkcd.com/936/

Then it starts its dialogue with the Domain Name Server (DNS), to obtain the IP behind mydatabase.mysql.db. Once the IP is known, the webserver can talk with the database server.

The webserver checks the website's configuration to know where the database is located. Then it starts its dialogue with the Domain Name Server, to have the IP behind mydatabase.mysql.db. Once the IP is known, the webserver can talk with the database server.

But having one network name for every database is something relatively new in OVH history. It was first introduced five years ago, with the Gravelines’ architecture described here.

A 20-year legacy

In the beginning, we used to give our customers the IP address of the database server.

On the positive side, you avoided querying the DNS server, you saved time, you avoided a possible bottleneck, and you may have even avoided a potential point of failure.
On the negative side, if we had to change the server and its IP couldn’t be moved – let’s say, when a migration occurred – every customer had to manually change their website’s configuration.

$database_server = '10.0.59.42';
$database_name = 'mydatabase';
$database_user = 'mydatabase';
$database_password = 'correct horse battery staple';
# ref: https://www.xkcd.com/936/

As we have loyal customers (including many of you, in this case!), it was not a bad idea, but not really scalable.

Giving directly the DB server IP to the webserber

So, to ease server management, hardware end of life etc., OVHcloud switched to providing our customers with the server’s name instead of the server’s IP.

$database_server = 'mysql55-59.pro';
$database_name = 'mydatabase';
$database_user = 'mydatabase';
$database_password = 'Brussels Sprouts Mandela Effect';
# ref: https://www.xkcd.com/2241/

Giving the DB server name to the webserver

To sum-up, we have three ways to connect to a database in P19:

Using the server’s IP
Using the server’s name
Using the database’s name

The blocker

Following the rules described here, as we moved your databases individually and not a whole server at a time, and databases were shuffled across all our new instances rather than staying with their neighbour, we could not reuse the server’s IP address. Also, we could not reuse the server’s name.

We seriously thought about automatically changing all our customers’ configurations.

Be this meant we would have had to:

Parse hundreds of terabytes. Doable, but time-consuming.
Understand your website’s organisation. This can be automated for up to 99% of all different uses cases, but 1% of 1.5million websites is still 15.000 potentially broken websites following the replacement.
Replace the IP or servername with the database name. Possible, but not 100% reliable.

What if the customer:

Uses its database in a binary file, like a compiled CGI written in C? Not parsable.
Uses a different database than the one we have the link? How would we know what should be used?
Does not use its database? How would we know it was actually the correct result if we found nothing?

As this was clearly not an option, we had to find a way to support every use case.

If the configuration was:

mydatabase.mysql.db: No problem. The DNS was up-to date.
servername: The query would reach the server and need to be forwarded.
server IP: The query would reach the server and need to be forwarded.

The solution

Maybe some of you have already started to figure out the solution…

We needed a piece of software to masquerade as the server, but actually redirecting incoming connections to the real server behind it. In other words, a proxy!

We decided to use René Cannaò’s proxySQL.

ProxySQL is more or less designed for a small number of users, against a small number of databases, mainly for splitting read-only and read-write access on different backends, which can greatly help a database with few writes and a lot of reads support the load. It can also help parsing and securing queries. But using it for our purposes was pushing it to its limits. We must say, this product rocks!

Paris side

First of all, we had to deploy it on hundreds of servers in P19.

Our architecture allowed us to install one ProxySQL on each database host. This was done between the 27th November and the 6th December 2019 (http://travaux.ovh.net/?do=details&id=35499).

Once all of them were up and running, we could start the migration. For every database, the ProxySQL was updated, so it would track where the database really was, in Paris or in Gravelines.

Easy-peezy, isn’t it? As the ProxySQL runs on the same host, we didn’t need to fake anything and it worked like a charm. But what would happen if a webserver in Gravelines called a database that was still in Paris?

DNS errors on calls from Gravelines to Paris

Gravelines side

This time, we did not have the choice but fake the server name in Gravelines. What could we use? A ProxySQL of course! But on our new architecture we just could not install them on every hosts, because as seen previously, we have 10 times less customers per instance in Gravelines than in Paris.

So we had to set up a ProxySQL farm and a lying DNS which resolves to a member of the farm for queries about servers in Paris. And using the same mechanism for updating ProxySQL configuration during migration, we finally had this kind of infrastructure:

Working architecture with bi-directional network tunnel and ProxySQLs in both sides

But what about configurations with IP addresses in them? As we had ProxySQL loaded with users for a group of P19 servers, we could – thanks to a routing trick – make the webservers think that P19 IPs were set up on a ProxySQL!

The downside

You just learned that to allow the migration to happen in the required timeframe, we had to introduce ProxySQL. Even if it is a great piece of software, and was almost a life-saver for us, as sysadmins, we prefer to avoid maintaining more and more different solutions on a daily basis. This is why we have decided to drop support for ways to connect to your database other than .mysql.db in the months to come. More details will come soon.

If you want to retain our state-of-the-art way of doing things (and also give us a hand with keeping the changes smooth!), you can check that your website already uses .mysql.db in your config files.

Conclusion

Having to support multiple ways of connecting to a database gave us loads of headaches. Finding a great piece of software was our aspirin, and allowed us to enable the –quiet flag in our migration process, so customers did not have to change their configurations for their websites to stay online.

However, you may have noticed it did not go quite as smoothly as we wanted. We encountered many small, sometimes unexpected, problems that we had to fix ASAP…

You may remember the abnormal amount of “max_user_connections” (http://travaux.ovh.net/?do=details&id=35689).
We had to play with ProxySQL’s source code to enable its use against mysql 5.1 combined with the pre-mysql 4.0 password hash algorithm.
We found funny things about encoding in the source code we patched upstream.
We were able to flood our host’s ARP table multiple times per hour.

You will learn more about these challenges in our upcoming posts!

Fabien Bagard

DevOps at OVHCloud | + posts

Member of the database team since 2017