Hello everyone,

As many of you have noted, we encounter since March different problems on the Swift infrastructure that stores data Hubic. From your point of view, these problems can be summed up simply by performance problems and stability. I'll try to detail a little more what it is.

The problems appeared when we decided to migrate our storage to the Erasure Coding. The goal of this migration is to continue to offer the service in its current form, without change quota on free accounts or change tariff on paying accounts.

One of the problems we encounter today concern the file system we use. The file system is the middle layer that allows your operating system to talk to your hard drive. We use XFS. On each server that stores your data, there are several hard disks that are formatted with XFS.

When we use the replica method (x3), store your 1MB file meant for us store three 1MB files on three hard drives. Since entering Erasure Coding 12 + 3 fragments, each of your files 1M represent to us 15 files of about 83KB
+ 15 files called "sustainable" that have a role to "commit". It therefore passes our cluster:
- 3 files 30 files
- 3MB to 1.25MB

Of course, it saves space, but also multiplies the number of files by 10. It does seem to say anything like that, but at Hubic and billions of files, it has some consequences.

The main, it's just an explosion of the size of the inode table. In the structure of the XFS file system (like most other file systems), a file is represented by what is called an inode. To simplify, the inode is an entry in a database that stores metadata of the file.

The first consequence is that the inode cache is less efficient.
Should be increased by 10 RAM in servers to maintain the cache at the current level (referred to several hundred TB of RAM).
The cache is less efficient, one must usually go for the disc information (and a hard disk is slow, several milliseconds to each access). This generates an increase in the load of the discs and thus performance.

Second consequence, less obvious, worries stability. With this level of use XFS, we encounter many problems. For now, Linux kernel stops sending requests to the file system: can not read or write to the hard drive. The data are no longer accessible until the server is restarted.
Other concerns, under heavy load, the Linux kernel crashes. The machine is planted, it must materially restart (reset). Yet another worry, corruptions "spontaneous" file system. Corruptions sometimes appear for no apparent reason. Similarly, if at times a hard drive is filled completely, the file system is corrupted and becomes inaccessible. Important note: it is not of your data, but the file system structure is corrupt.

These corruptions are very annoying because with the size of the inode table, it becomes complicated to repair a file system. It takes in 12 to 24 hours, and consumes around 80GB of RAM. We can then do it directly in production, it is necessary to put the drive into a machine dedicated to making filesystem repairs with suffisement RAM for.

During the discussions I had with the Swift community, it appears that:
*- We would be the largest cluster Swift erasure coding they are aware
*- We would be the only encounter such problems with XFS

However, other problems recontres approaching without reaching our extreme:
http://oss.sgi.com/archives/xfs/2016-01/msg00169.html
http://oss.sgi.com/archives/xfs/2016-02/msg00053.html

Now that the finding is made, we must act. Linux kernel patching is excluded, he must have very sharp skills development kernel and file system. And without knowing the magnitude of the work involved, it is not safe to see the end in a reasonable time. We therefore examined several alternative filesystem XFS. Due to the constraints of Swift, following file systems have been tested:
*- XFS v4 and v5 (yeah, you need a reference in a test)
*- ZFS
*- JFS
*- ReiserFS

The other most common file systems (ext2 / 3/4, btrfs, ...) are not compatible with Swift.

I will not go into details of the tests.

The conclusion is that no file system is perfect. There's just less bad than others: ReiserFS. Its perf are averaged but constant whatever the degree of filling and the number of stored file. It is durable, it is natively integrated into the kernel.
(The less stable it is JFS, after a hard reboot was never able to re-access the test data!)

We'll deploy a server (and only one for the moment) ReiserFS to verify in real life behavior and compare it to other servers in XFS. According to the results, we will choose to continue or not. We are moving cautiously despite the current concerns as to a storage facility, the file system is a bit like flour to a baker => it is based and must not miss.

This instability with XFS derives a lot of issues that I would return in a future email.

also note that if we definitively validate ReiserFS, XFS total replacement is unlikely given the number of hard disk that represents. XFS and ReiserFS therefore cohabit for a while. It means that we are also working on other areas for improvement.

Best regards.