Swift Archives - OVHcloud Blog

Dealing with small files with OpenStack Swift (part 2)

Alexandre Lecuyer — Fri, 17 Jan 2020 09:16:09 +0000

In the first part of these articles, we demonstrated how storing small files in Swift may cause performance issues. In this second part we will present the solution. With this in mind, I will assume that you have read the first part, or that you are familiar with Swift.

Files inside files

We settled on a simple approach: we would store all these small fragments in larger files. This means that inodes usage on the filesystem is much lower.

These large files, which we call “volumes”, have three important characteristics:

They are dedicated to a Swift partition
They are append only: never overwrite data
No concurrent writes: we may have multiple volumes per partition

We need to keep track of the location of these fragments within a volume. For this we developed a new component: the index-server. This will store each fragment location: the volume it is stored in, and its offset within the volume.

There is one index-server per disk drive. This means that its failure domain matches with the data it is indexing. It communicates with the existing object-server process through a local UNIX socket.

Leveraging on LevelDB

We chose LevelDB to store the fragment location in the index-server:

It sorts data on-disk, which means it is efficient on regular spinning disks
It is space efficient thanks to the Snappy compression library

Our initial tests were promising: it showed that we need about 40 bytes to track a fragment, vs 300 bytes if we used the regular file-based storage backend. We only track the fragment location, while the filesystem stores a lot of information that we don’t need (user, group, permissions..) This means the key-value would be small enough to be cached in memory, and listing files would not require physical disk reads.

When writing an object, the regular swift backend would create a file to hold the data. With LOSF, it would instead:

Obtain a filesystem lock on a volume
Append the object data at the end of the volume, and call fdatasync()
Register the object location in the index-server (volume number, and offset within volume)

To read back an object:

Query the index-server to get its location: volume number and offset
Open the volume and seek to the offset to serve the data

However, we still have a couple of problems to solve!

Deleting objects

When a customer deletes an object, how can we actually delete data from the volumes? Remember that we only append data to a volume, so we can’t just mark space as unused within a volume, and try to reuse it later. We use XFS, and it offers an interesting solution: The ability to “punch a hole” in a file.

The logical size does not change, which means that fragments located after the hole do not change offset. The physical space, however, is released to the filesystem. This is a great solution, as it means we can keep appending to volumes, free space within a volume, and let the filesystem deal with space allocation.

Directory structure

The index-server will store object names in a flat namespace, but Swift relies on a directory hierarchy.

/mnt/objects////.data

The partition directory is the partition which the object belongs to, and the suffix directory is just the last three letters of the md5 checksum. (This was done to avoid having too many entries in a single directory)

If you have not used Swift before, the “partition index” of an object indicates which device in the cluster should store the object. The partition index is calculated by taking a few bits from the object’s path MD5. You can find out more here.

We do not explicitely store these directories in the index-server, as they can be computed from the object-hash. Remember that the object names are stored in order by LevelDB.

Data migration

This new approach changes the on-disk format. However we already had over 50 PB of data. Migrating offline was impossible. We wrote an intermediate, hybrid version of the system. It would always write new data using the new disk layout, but for reads, it would first look up in the index-server, and if the fragment wasn’t found, it would look up the fragment in the regular backend directory.

Meanwhile, a background tool would run to slowly to convert data from the old system to the new one. This took a few months to run over all machines.

Results

After the migration completed, the disk activity on the cluster was much lower: we observed that the index-server data would fit in memory, so listing objects in the cluster, or getting an object’s location would not require physical disk IO. The latency improved for both PUT and GET operations, and the cluster “reconstructor” tasks were able to progress much faster. (The reconstructor is the process that rebuilds data after a disk fails in the cluster)

Future work

In the context of object storage, hard drives still have a price advantage over solid state drives. Their capacity continues to grow, however the performance per drive has not improved. For the same usable space, if you switch from 6TB to 12TB drives, you’ve effectively halved your performance.

As we plan for the next generation of Swift clusters, we must find new ways to use these larger drives while ensuring performance is still good. This will probably mean using a mix of SSDs and spinning disks. Exciting work is happening in this area, as we will store more data on dedicated fast devices to further optimise Swift’s cluster response time.

Dealing with small files with OpenStack Swift (part 1)

Alexandre Lecuyer — Fri, 04 Oct 2019 08:37:39 +0000

OpenStack Swift is a distributed storage system that is easy to scale horizontally, using standard servers and disks.

We are using it at OVHcloud for internal needs, and as a service for our customers.

By design, it is rather easy to use, but you still need to think about your workload when designing a Swift cluster. In this post I’ll explain how data is stored in a Swift cluster, and why small objects are a concern.

How does Swift store files?

The nodes responsible for storing data in a Swift cluster are the “object servers”. To select the object servers that will hold a specific object, Swift relies on consistent hashing:

In practice, when an object is uploaded, a MD5 checksum will be computed, based on the object name. A number of bits will be extracted from that checksum, which will give us the “partition” number.

The partition number enables you to look at the “ring”, to see which server and disk should store that particular object. The “ring” is a mapping between a partition number, and the object servers that should store objects belonging to that partition.

Let’s take a look at an example. In this case we will use only 2 bits off the md5 checksum (far too low but much easier to draw! There are only 4 partitions)

When a file is uploaded, from its name and other elements, we get a md5 checksum, 72acded3acd45e4c8b6ed680854b8ab1. If we take the 2 most significant bits, we get partition 1.

From the object ring, we get the list of servers that should store copies of the object.

With a recommended Swift setup, you would store three identical copies of the object. For a single upload, we create three actual files, on three different servers.

Swift policies

We’ve just seen how the most common Swift policy is to store identical copies of an object.

That may be a little costly for some use cases, and Swift also supports “erasure coding” policies.

Let’s compare them now.

The “replica policy” which we just described. You can choose how many copies of objects you want to store.

The “erasure coding” policy type

The object is split into fragments, with added redundant pieces to enable object reconstruction, if a disk containing a fragment fails.

At OVHcloud, we use a 12+3 policies (12 pieces from the object and 3 computed pieces)

This mode is more space efficient than replication, but it also creates more files on the infrastructure. In our configuration, we would create 15 files on the infrastructure, vs 3 files with a standard “replication” policy.

Why is this a problem?

On clusters where we have a combination of both an erasure coding policy, and a median object size of 32k, we would end up with over 70 million files *per drive*.

On a server with 36 disks, that’s 2.5 billion files.

The Swift cluster needs to regularly list these files to:

Serve the object to customers
Detect bit rot
Reconstruct an object if a fragment has been lost because of a disk failure

Usually, listing files on a hard drive is pretty quick, thank’s to Linux’s directory cache. However, on some clusters we noticed the time to list files was increasing, and a lot of the hard drive’s IO capacity was used to read directory contents: there were too many files, and the system was unable to cache the directory contents. Wasting a lot of IO for this meant that the cluster response time was getting slower, and reconstruction tasks (rebuilding fragments lost because of disk failures) were lagging.

In the next post we’ll see how we addressed this.