How to win at the massive database migration game

In our previous articles, we explained why we had to move 800,000 databases from one datacentre to another, 300 kilometres away. So here we are… My team and I did it! It was a real brain-burner, so I hope our story will help you to address more of the huge technical projects we love to play with.

The rules

To reduce latency, a database must be migrated at the same time as the website using it.
Because databases are spread across all available MySQL servers, the granularity of a migration has to be the database, not the MySQL instance. In other words, we cannot migrate a whole MySQL server. We must only move a part of it.
As the link between a website and its database is not necessarily referenced, a website in Gravelines should be able to contact a database in Paris (for example), and vice-versa.
To contact its database, a website uses a host name, a user name and a password. We want the migration to be transparent, so nobody should have to change any of those elements to contact their new database.
Databases’ platforms change between Paris and Gravelines, as shown below.

To sum up, this is what we have before a web cluster migration:

And this is what we want after the migration:

A few more things…

Obviously, we had to keep in mind one of the most important thing when dealing with databases: consistency. For each database, we had to define a consistency point. Before this point on the timeline, read/writes were made in Paris. After it, read/writes were made in Gravelines.
We believe in transparency and reversability. These are both key parts of our SMART cloud. That’s why we wanted to give you access to this consistency point, as a dump on your OVHcloud Control Panel. For every database migrated, we chose to give you access to the dump for one month.
Migrating 800K databases in about 60 nights meant we had to be very fast and scalable. Our record was on the 1st of July 2019, when we successfully migrating 13,502 databases in 1 hour, 13 minutes and 31 seconds.
If you’re used to being on duty, you know that your attention and efficiency are lower during the night. Repeating the migration process about 60 times in one year would amplify that, so we wanted everything to be as automated and as simple as possible. As we will see later, to launch the database migrations, we just had to run one command on a single host:

migrate-p19

Now you know rules, it’s time to start the game!

Level 1

The first level is always an easy one, where you discover how the game works through a kind of tutorial! So, let’s begin with a small database migration. This is how we do it:

1. At the source (Paris)

Set read-only mode. We absolutely need to avoid writes during the migration, to avoid the famous split-brain. The easiest way to do this is to put the database in read-only mode. In most cases, websites need only to read the databases, but in some cases, they need read and writes, and they will therefore be broken. That’s not a problem, because the website is currently migrated and is closed. We’ll block write access, in case the database is used by another host that isn’t affected by the nighttime migration.
Dump the database and put the dump somewhere. We choose to store the dumps on OVHcloud’s Public Cloud Storage (PCS), as we already use this solution to store 36 million dumps per month. Adding 800,000 dumps in one year is no problem for this awesome platform!

2. At the destination (Gravelines)

Retrieve the dump and import it.
Create the user and permissions, with write access.

3. Switch to the new database

At this point, the website is still calling the database at Paris. So the website (whether it’s hosted in Paris or Gravelines) can contact the new database, we’ll update the DNS so that the name points to the Gravelines MySQL instance rather than the Paris one.
The read access to the Paris database is also removed.
Finally, we’ll update our information system, so you can retrieve the dump from PCS via your Control Panel. This update also allows us to redirect all actions available from the Control Panel (e.g. change a password, create a dump…) to the new database at Gravelines.

Level 2: “Decentralised state machine”

To deliver the proof of concept for the migration, we first ran all those steps manually and sequentially. The natural way to automate this is to write a script to do the same thing, but faster. That’s a centralised method, but such methods experience bottlenecks sooner or later, and imply a single point of failure.

To prevent this, and fulfil our scalability requirements, we have to be decentralised. We must imagine the migration of a single database as a state machine. Here is a simplified version of the state graph of a database migration, as described above:

Using this state machine, we can perform those three big steps by different machines to parallelise the workload:

The source
The destination
The one updating the DNS

These three hosts can perform their tasks in an independent and decentralised way. All they have to do is watch the state graph to see if they have something to do, and if so, update it and perform the tasks.

The brain of the migration: CloudDB

We love the ‘eat your own food’ concept! That’s the best quality control ever, and with the feedback you give us, our first source of feature requests. So it’s no surprise that we used our own CloudDB product to store the state graphs of the databases migrations.

Technically, a state graph is a row in a table. The simplified structure of this table looks like this:

- database_name VARCHAR(255) PRIMARY KEY,
- source VARCHAR(255),
- destination VARCHAR(255),
- status VARCHAR(255) NOT NULL DEFAULT 'Waiting',
- dump_url TEXT

Except for the dump_url, all the fields are filled before the migration begins. In other words, we know where the databases are, and where they will be.

We’ve beaten all the challenges of this level. Now it’s time to beat the final monster!

Level 3: Migrate 800K databases

Now that we know how to migrate a single database in a decentralised way, let’s fill the CloudDB with all the databases we want to migrate! This is what the migration now looks like:

In Paris

About once a minute*, each host of the 780 database servers asks the CloudDB if they have something to dump. The source and status columns of the table are used to retrieve this information:

SELECT … WHERE source = me AND status = 'To dump';

If so, they execute their tasks and update the CloudDB about what they’re doing. When they’re done, they pass the baton for this migration to Gravelines:

UPDATE … SET status = 'To import' WHERE database_name = '…';

In Gravelines

At the same time, 300 kilometres away, hundreds of database servers are also asking the CloudDB if they have something to import. As in Paris, they query the CloudDB about once a minute*. The destination and status columns of the table are used to retrieve this information:

SELECT … WHERE destination = me AND status = 'To import';

If so, they execute their tasks and update the CloudDB about what they’re doing. When they’re done, they pass the baton to a third robot, who will change the DNS records for this database migration:

UPDATE … SET status = 'DNS to update' WHERE database_name = '…';

(*) To avoid flooding the CloudDB, we use a random frequency to query the database with the state graphs. This way, the connections are globally distributed over time.

DNS update

The robot responsible for the DNS update is the third player in the migration process, and works in the same way as the dump and import robots described above.

Not that easy…

Of course, the actual game was more complex. This was a simplified version of the migration, with some steps missing or not detailed enough, such as:

Preventing writes to the source database
Updating the IS (among others) so you can see the dump in the Control Panel
Setting the password on the destination (the same as the one on the source) without knowing it
And many others

But now that you’ve got the concept for the main steps, you can imagine how we handled the others.

Cheat code: Iterate!

Do you know the law of truly large numbers? It states that with a large enough number of samples, any unlikely thing is likely to be observed.

This is one of the first lessons you learn when you host 1.2 million databases. Every day, we face the many improbable things that can happen with databases, so we knew that, despite the tests we conducted, we would encounter difficulties, weird cases, and improbable bottlenecks.

But there’s a cheat code to beat that boss: iterate!

Begin the migration
Face a problem
Fix it definitively (not just for the particular case that failed, but also for all similar cases across the whole platform)
Then try again, faster!

This method is possible thanks to two things:

The magic command
The big red button

The magic command

As mentioned above, to launch the database migrations, we had to run one command on a single host:

migrate-p19

This magic command has one parameter: the number of parallel migrations you want to do. We used 400 for this parameter.

migrate-p19 --max-procs 400

This means that 400 databases are being dumped or imported at the same time – no more, no less.

The migrate-p19 command is a scheduler. It updates the CloudDB every 10 seconds, so we always have those 400 migrations performed in parallel:

SELECT COUNT(database_name) … WHERE status in ('To dump', 'Dumping', 'Dump failed', 'To import', …);
42
UPDATE … SET status = 'To dump' WHERE status = 'Waiting' LIMIT (400 - 42);

Pause the game: The big red button

In every machine, it is mandatory to have a big red button to press when something’s wrong. To interrupt the migration, for a reason or another, we just have to kill the migration-p19 script. When we do this, the ongoing migrations terminate themselves, after which no new ones are launched.

The interruption is automatic. If something’s really bad, the scheduler can see that there are too many operations in an error state, and then decide to stop the migration.

Our database migration adventure will continue on this blog. Indeed, the world is not perfect, and we’ve had to handle some interesting cases…

To be continued…

Want to be part of the team bringing this migration to life, and keeping this huge platform in an operational state? We are hiring an open-source SRE and automation addict in France and Canada!

Mikaël Davranche

+ posts