One Billion Dollars! Wait… I Mean One Billion Files!!!

One Billion Dollars! Wait… I Mean One Billion Files!!!
Linux Magazine ^ | 6 October 2010 | Jeffrey B. Layton

Posted on 10/08/2010 8:06:52 AM PDT by ShadowAce

The world is awash in data. This fact is putting more and more pressure on file systems to efficiently scale to handle increasingly large amounts of data. Recently, Ric Wheeler from Redhat experimented with putting 1 Billion files in a single file system to understand what problems/issues the Linux community might face in the future. Let’s see what happened…

Awash in a sea of data

No one is going to argue that the amount of data we generate and want to keep is growing at an unprecedented rate. In a 2008 article blogger Dave Raffo highlighted some statistics from an IDC model of enterprise data growth rate, that unstructured data was increasing at about 61.7% CAGR (Compounded Annual Growth Rate). In addition, data in the cloud (Google, Facebook, etc.) was expected to increases at a rate of 91.8% through 2012. These are astonishing growth rates that are causing file system developers to either bite their finger nails to the quick or for them to start thinking about some fairly outlandish file system requirements.

As an example, on lwn, a poster mentioned that a single MRI instrument can produce 20,000 files in a single scan. In about 9 months they had already produced about 23 million files from a single MRI instrument.

Individual’s are taking digital picture with just about everything they own with cell phone images being the most popular. These images get uploaded to desktop and laptops, and, hopefully, end up on backup drives. Many of these images are also uploaded to Facebook or flickr or even personal websites. I know of a friend’s daughter who just started college and already has over 15,000 pictures of which a majority are on Facebook. With a family of 4, each taking 5,000-10,000 pictures a year, you can easily generate 20,000-40,000 files per year. Then you throw in email, games, papers, Christmas cards, music and other sources of data, a family can easily generate 1 million files a year on a family desktop or NAS server.

So far we’ve been able to store this much data because 2TB drives are very common, and 3TB drives are right around the corner. These can be used to create storage arrays that easily hit the 100TB mark with relatively little financial strain. Plus we can buy 2-10 of these drives during sales at Fry’s and stuff them in a nice case to give us anywhere from 2TB to 20TB of space just in our home desktop.

There are huge questions surrounding all of this data and it’s storage. How can we search this data? How can we ensure that the data doesn’t become corrupted? (sometimes that means making multiple copies so our storage requirements just doubled). How do we move data to/from our laptops, cell phones, desktops, to a more centrally controlled location? How do we backup all of this data? But perhaps one of the more fundamental questions is, can our storage devices, specifically our file systems, store this much data and still be able to function?

Smaller Scale Testing

Recently, Ric Wheeler from Redhat started doing some experiments with file systems to understand their limitations in regard to scale. In particular, he wanted to try loading up Linux file systems with 1 billion files as an experiment and see what happened.

As Ric pointed out in a presentation he made at 2010 LinuxCon, 1 billion files is very conceivable from a capacity perspective. If you use 1KB files, then 1 billion files (1,000,000,000) takes up only 1TB. If you use 10KB files, then you need 10TB’s to accommodate 1 billion files (not to difficult to imagine even in a home system). If you 100KB files, then you need 100TB’s to hold 1 billion files. Again it’s not hard to imagine 100 TB’s in a storage array. The point is that with smaller files, current storage devices can easily accommodate 1 billion files from a capacity perspective.

Ric built a 100TB storage array (raw capacity) for performing some tests. But, as previously mentioned, you don’t need much capacity for performing these experiments. According to Ric the life cycle of a file system has several stages

Create the file system (mkfs)
Fill the file system
Iteration over the files (basically use it)
Repair the file systems (fsck)
Removing files

Ric create 1 million file file systems and experimented with each of these steps to understand how they performed. He examined several file systems - ext3, ext4, xfs, and btrfs - for each of these stages and recorded the results.

To understand the amount of time it takes to create file systems, Ric performed a simple experiment by first creating a 1TB file system using the four file systems on a SATA disk. To understand if the bottleneck was the performance of the storage media itself, he also built a 75GB file system on a PCIe SSD. Figure 1 below is from his LinuxCon talk plotting the amount of time it took for each file system to be created.

Figure 1: File System make (mkfs) for four file systems and two hardware devices

Remember that he’s focusing on 1 million file file systems for these experiments. Notice that ext3 and ext4 took a long time to create because of the need to create static inode tables. XFS, with dynamic inode allocation, is much faster only taking about 20 seconds (ext3 took approximately 275 seconds). On the other hand, creating the file systems on the PCIe based SSD was considerably faster relative to the SATA drive. However, it is still noticeable that ext3 and ext4 took longer than the XFS and btrfs to create a file system.

The second phase of the file system’s life cycle is to fill the file system with data. Recall that Ric is using 1 million files to fill up the file systems. He used 1,000 directories with 1,000 file each. Figure 2 below shows the amount of time it took to create 1 million files on the two types of hardware - 1TB file system on a SATA drive and a 75GB file system on a PCIe SSD.

Figure 2: File System file create (fill file system) for four file systems and two hardware devices

You can see in the figure that ext3 and XFS took the longest time to fill the file system on the SATA drive. It took ext3 about 9,700 seconds or 2 hours and 42 minutes, XFS took about 7,200 seconds or right around 2 hours, ext4 took about 800 seconds (a little over 13 minutes), and btrfs took about 600 seconds (10 minutes). On the PCIe SSD, while difficult to tell in the figure, XFS took slightly longer than ext3, ext4, or btrfs. Ext4 was the fastest file system to fill up on the SSD, but all four file systems take so little time that it’s difficult to differentiate between them.

The fourth phase in the life cycle is to repair the file system. Figure 3 below plots the amount of time it takes to repair the 1 million file configurations.

Figure 3: File system check/repair (fsck) for four file systems and two hardware devices

It’s pretty obvious that ext3 is much slower than the other file systems on both storage media, but it’s really noticeable on the SATA drive. It took about 1,040 seconds to repair the file system while btrfs, which had the second worse time, took only about 90 seconds. However, notice how fast the file system was repaired on the PCIe SSD. Even ext3 took only about 80 seconds to repair 1 million files.

The final life cycle phase is the remove files from the file systems. Figure 4 below plots the time it took to remove all 1 million files from each file system for both storage media.

Figure 4: File remove for four file systems and two hardware devices

Notice that XFS is much slower than even ext3 on the SATA drive. It took XFS about 3,800 seconds (a little over an hour) to remove all 1 million files. The next slowest file system was ext3 and it took about 875 seconds (not quite 15 minutes) to remove the files. Ext4 was the fastest on the SATA drive but btrfs wasn’t too far behind. On the other hand, on the PCIe SSD, the slowest was btrfs followed by ext3, XFS, and then ext4. But the differences in time is very small.

While the previous tests were for 1 million files, they did point out some glaring differences in the various file systems. But Ric wanted to really push the envelope so he built a very large storage array of almost 100 TB’s using 2TB SATA drives and drive arrays. He formatted the file system with ext4 and then ran about the same tests as he did for the 1 million files but used 20KB files (1 billion of them). Here’s a quick summary of what he found.

Make the file system (mkfs)
- Approximately 4 hours
Fill the file system (1 billion files of about 20KB each)
- Approximately 4 days
Run a file system check (fsck) with 1 billion files
- 2.5 hours
XFS still has problems with meta-data intensive workloads.
Faster storage can be helpful. In particular, Ric mentioned that btrfs can use SSD’s for metadata and leave the bulk data on the slower storage.
- I hope everyone has also read my article series about a patch that allows you to use SSD’s to cache block devices (storage devices). There are other options for using SSD’s for caching including flashcache and one patch set that has caught the eye of developers gave btrfs the ability to take the “temperature” of data so that it could be moved to a faster or slower device as needed (the movement aspect was not in the proposed patch set).

Ric underscores that the rates are consistent for zero length files and small files (i.e. making the file really small didn’t help the overall performance rates).

The absolute best thing from this testing is that current file systems can handle 1 billion files. They may not be as fast as you want on SATA drives but they did function and you could actually fsck the file system (if you need more speed you can always flip for some MLC based SSD’s).

Ric talked about some specific lessons they learned from the testing:

When he fsck-ed the 1 billion file ext4 file system (a total of 70TB capacity), it took about 10GB of memory during the operation. This may sound like a great deal of memory and on today’s laptops and desktops this may be true, but on servers this amount of memory is fairly common.
Using xfs_repair (the XFS file system repair tool), on a large file system took almost 30GB of memory which is quite a bit of memory even for servers.

Ric also mentioned that with a file system with 1 billion files, running an “ls” command is perhaps not a good idea. The reason is that ls uses both the readdir and stat system functions which means that all of the metadata has to be touched twice. The result is that it takes a great deal of time to perform the “ls” operation. He did point out a way to reduce the time but performing an “ls” is still not a fast operation. Moreover, he did point out that the performance of file enumeration, which is what “ls” does to some degree, proceeds at the rate of file creation so it could take quite a while to perform the “ls” command.

There have been proposal for improving the responsiveness of “ls” for large file systems but nothing has been adopted universally. One proposal is to do what is termed “lazy updates” on metadata. In this concept, a summary of the metadata is kept within the file system so that when a user performs an “ls” operation, the summary is quickly read and the results are given to the user. However, the “lazy” part of the name implies that the summary data may not be absolutely accurate. It may not have the file sizes absolutely correct or it may not have the file that was created a microsecond prior to the “ls”. But the point of lazy updates is to allow users to get an idea of the status of their file system. Of course, “ls” could have an option such as “-accurate” that tells the command to use readdir and stat to get the most accurate state of the file system possible.

However, even this “accurate” option may not get you the most accurate information. Because there are so many files, by the time the last files have been accessed the status of the first files may have changed. To get “hyper-accurate” values for large file systems, you need to freeze the file system, perform the operation, and then continue to use the file system as normal. I’m not sure how many people would be willing to do this. But the problem is that I’ve seen people us the “ls” command as part of an application script. Not the brightest idea in the world in my opinion, but it allowed them to compute what they wanted.

Finally Ric underscored two other gotchas with extremely large file systems. The first one is the remote replication of backup to tape is a very long process. This is because, again, enumeration and the read rate of the file systems drop in terms of performance while other file system operations happen concurrently. This increases the length of time to perform an already long series of operations.

The second thing Ric highlighted was that operations such as backup, that take a long time, could be prone to failures (Ric actually used the words “will fail”). Consequently, for some of the these operations we will need to develop a checkpoint/restart capability and even do only a minimal number of IO retries when hitting a bad sector (currently many file systems will try several times with an increasing amount of time between retries - this increases the amount of time the file system will try to read a bad sector, holding up the whole enumeration/read processes).

Summary

While it sounds fairly simple, and conceptually it is, Ric’s file system experiments really highlight some limitations in our current file systems. With the increasing pressure of massive amounts of data, having file systems that can scale to billions of files is going to become a requirement (not a “nice-to-have”).

Ric’s simple tests on file systems with 1 million files can easily be done by anyone using small enough files. But these experiments really draws attention to the differences in the file systems as well as sheer amount of time it can take to performance file system tasks. But the really good news is that it is definitely possible to function with file systems with 1 million files if we are a little more patient around the length of time to complete certain operations.

Ric’s 1 billion file experiment was the really cool final experiment and we learned a great deal from it. Firstly, we learned that we can actually build and run 1 billion files in our current file systems. They may be slower but they can function. However, the time it takes to performance certain functions have underscored the differences in the file systems. But just as any experiment that pushes the edges of the proverbial envelope, we learned some new things:

Faster storage hardware really helps (perhaps we need to find a way to get SSD’s to be more effectively used in Linux and/or our current file systems
Doing an “ls” command on a 1 billion file file system is not the best idea. It looks as though we need to rethink how to make this function and others more efficient on really large file systems.
Performing backups or remote replication is going to be a very long process. This means that we have to be ready for failures during the processes pointing out the need for some sort of checkpoint/restart capability.

I want to thank Ric for his great presentation and permission to use the images.

TOPICS: Computers/Internet
KEYWORDS: filesystems; linux

Navigation: use the links below to view more comments.
first previous 1-20, 21-24 last

To: antiRepublicrat

> Never underestimate the bandwidth of a station wagon full of backup tape

I had occasion to recall that old saw a couple months ago, moving a lot of data (though nowhere near 23TB) from one center to another, a couple of miles apart. We've got gig-E-over-fiber between the centers, but because of other traffic on the line, we only saw about 20MB/sec and the transfer time was estimated at the better part of a day. It had to be done quicker than that.

I pulled half of the RAID mirror array, threw it (well, not actually "threw") in the backseat of my car and drove it to the other center, hooked it up, copied, drove back, plugged the drive back in the RAID to rebuild, in half the time it would have taken to transfer over the fiber.

One of -those- days. :)

21 posted on 10/08/2010 10:12:15 PM PDT by dayglored (Listen, strange women lying in ponds distributing swords is no basis for a system of government!)

[ Post Reply | Private Reply | To 18 | View Replies]

To: sionnsar

IDE or SATA? I had some serious issues with kubuntu a while back. It was a SATA hard disk, 10K RPM, and I could get through the whole installation and configuration part of the install, even get to the KDE desktop, but as soon as I started installing security updates, it crashed. Every single time, randomly during the install process.

Try running ext4 on SSD. I’ve never been so comfortable about a hard disk as I was with my 100 GB SSD. It ran smooth as silk, no data access issues, quiet, and I consistently got 1+ GB transfer rates (between 2 SSDs).

For the record, I would not recommend ext4 in enterprise environments. Linux runs amazingly well on a ProLiant DL360, but I can swear that ext3 ran better on a RAID5 disk array.

22 posted on 10/09/2010 3:34:47 PM PDT by rarestia (It's time to water the Tree of Liberty.)

[ Post Reply | Private Reply | To 19 | View Replies]

To: rarestia

This is old equipment: IDE. I don't think we have any SATA drives in the house workstations.

I don't remember the details of the stability problem, other than having to reinstall Xubuntu 9.10 several times before I gave up and went to 9.04, which has been quite solid. Was sure glad I had /home in its own partition!

I used Kubuntu for several years, but it became progressively more sluggish with newer releases. Xubuntu 9.04 is not exactly snappy either on this hardware, but responds quickly enough.

23 posted on 10/09/2010 4:27:35 PM PDT by sionnsar (IranAzadi|5yst3m 0wn3d-it's N0t Y0ur5:SONY|TV--it's NOT news you can trust)

[ Post Reply | Private Reply | To 22 | View Replies]

To: zeugma

Desktop users, especially those that have never figured out how to properly organize files can quickly get into a state where they can't find files they are looking for, even if they know it's somewhere on the computer. Your average user doesn't know anything about how to efficiently organize directories and subdirectories. Heck, I sometimes have problems with it myself, and find myself going back through and reorganizing things periodically just to keep things straight.

Ha. Thought I had become pretty good about that, even filing my travel-related documents (flight bookings, hotel reservations, conference registrations, etc.) in directories specifically devoted to those trips (the Powerpoint tree for speaking engagements, for example) -- but yesterday found myself searching for a 13-year-old document that defied such a system. It turned out my manual search got me close, but it was in a subfolder whose name did not trigger an association with what I was searching for.

24 posted on 10/09/2010 4:52:00 PM PDT by sionnsar (IranAzadi|5yst3m 0wn3d-it's N0t Y0ur5:SONY|TV--it's NOT news you can trust)

[ Post Reply | Private Reply | To 20 | View Replies]

Navigation: use the links below to view more comments.
first previous 1-20, 21-24 last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794