One Billion Dollars! Wait… I Mean One Billion Files!!!

One Billion Dollars! Wait… I Mean One Billion Files!!!
Linux Magazine ^ | 6 October 2010 | Jeffrey B. Layton

Posted on 10/08/2010 8:06:52 AM PDT by ShadowAce

The world is awash in data. This fact is putting more and more pressure on file systems to efficiently scale to handle increasingly large amounts of data. Recently, Ric Wheeler from Redhat experimented with putting 1 Billion files in a single file system to understand what problems/issues the Linux community might face in the future. Let’s see what happened…

Awash in a sea of data

No one is going to argue that the amount of data we generate and want to keep is growing at an unprecedented rate. In a 2008 article blogger Dave Raffo highlighted some statistics from an IDC model of enterprise data growth rate, that unstructured data was increasing at about 61.7% CAGR (Compounded Annual Growth Rate). In addition, data in the cloud (Google, Facebook, etc.) was expected to increases at a rate of 91.8% through 2012. These are astonishing growth rates that are causing file system developers to either bite their finger nails to the quick or for them to start thinking about some fairly outlandish file system requirements.

As an example, on lwn, a poster mentioned that a single MRI instrument can produce 20,000 files in a single scan. In about 9 months they had already produced about 23 million files from a single MRI instrument.

Individual’s are taking digital picture with just about everything they own with cell phone images being the most popular. These images get uploaded to desktop and laptops, and, hopefully, end up on backup drives. Many of these images are also uploaded to Facebook or flickr or even personal websites. I know of a friend’s daughter who just started college and already has over 15,000 pictures of which a majority are on Facebook. With a family of 4, each taking 5,000-10,000 pictures a year, you can easily generate 20,000-40,000 files per year. Then you throw in email, games, papers, Christmas cards, music and other sources of data, a family can easily generate 1 million files a year on a family desktop or NAS server.

So far we’ve been able to store this much data because 2TB drives are very common, and 3TB drives are right around the corner. These can be used to create storage arrays that easily hit the 100TB mark with relatively little financial strain. Plus we can buy 2-10 of these drives during sales at Fry’s and stuff them in a nice case to give us anywhere from 2TB to 20TB of space just in our home desktop.

There are huge questions surrounding all of this data and it’s storage. How can we search this data? How can we ensure that the data doesn’t become corrupted? (sometimes that means making multiple copies so our storage requirements just doubled). How do we move data to/from our laptops, cell phones, desktops, to a more centrally controlled location? How do we backup all of this data? But perhaps one of the more fundamental questions is, can our storage devices, specifically our file systems, store this much data and still be able to function?

Smaller Scale Testing

Recently, Ric Wheeler from Redhat started doing some experiments with file systems to understand their limitations in regard to scale. In particular, he wanted to try loading up Linux file systems with 1 billion files as an experiment and see what happened.

As Ric pointed out in a presentation he made at 2010 LinuxCon, 1 billion files is very conceivable from a capacity perspective. If you use 1KB files, then 1 billion files (1,000,000,000) takes up only 1TB. If you use 10KB files, then you need 10TB’s to accommodate 1 billion files (not to difficult to imagine even in a home system). If you 100KB files, then you need 100TB’s to hold 1 billion files. Again it’s not hard to imagine 100 TB’s in a storage array. The point is that with smaller files, current storage devices can easily accommodate 1 billion files from a capacity perspective.

Ric built a 100TB storage array (raw capacity) for performing some tests. But, as previously mentioned, you don’t need much capacity for performing these experiments. According to Ric the life cycle of a file system has several stages

Create the file system (mkfs)
Fill the file system
Iteration over the files (basically use it)
Repair the file systems (fsck)
Removing files

Ric create 1 million file file systems and experimented with each of these steps to understand how they performed. He examined several file systems - ext3, ext4, xfs, and btrfs - for each of these stages and recorded the results.

To understand the amount of time it takes to create file systems, Ric performed a simple experiment by first creating a 1TB file system using the four file systems on a SATA disk. To understand if the bottleneck was the performance of the storage media itself, he also built a 75GB file system on a PCIe SSD. Figure 1 below is from his LinuxCon talk plotting the amount of time it took for each file system to be created.

Figure 1: File System make (mkfs) for four file systems and two hardware devices

Remember that he’s focusing on 1 million file file systems for these experiments. Notice that ext3 and ext4 took a long time to create because of the need to create static inode tables. XFS, with dynamic inode allocation, is much faster only taking about 20 seconds (ext3 took approximately 275 seconds). On the other hand, creating the file systems on the PCIe based SSD was considerably faster relative to the SATA drive. However, it is still noticeable that ext3 and ext4 took longer than the XFS and btrfs to create a file system.

The second phase of the file system’s life cycle is to fill the file system with data. Recall that Ric is using 1 million files to fill up the file systems. He used 1,000 directories with 1,000 file each. Figure 2 below shows the amount of time it took to create 1 million files on the two types of hardware - 1TB file system on a SATA drive and a 75GB file system on a PCIe SSD.

Figure 2: File System file create (fill file system) for four file systems and two hardware devices

You can see in the figure that ext3 and XFS took the longest time to fill the file system on the SATA drive. It took ext3 about 9,700 seconds or 2 hours and 42 minutes, XFS took about 7,200 seconds or right around 2 hours, ext4 took about 800 seconds (a little over 13 minutes), and btrfs took about 600 seconds (10 minutes). On the PCIe SSD, while difficult to tell in the figure, XFS took slightly longer than ext3, ext4, or btrfs. Ext4 was the fastest file system to fill up on the SSD, but all four file systems take so little time that it’s difficult to differentiate between them.

The fourth phase in the life cycle is to repair the file system. Figure 3 below plots the amount of time it takes to repair the 1 million file configurations.

Figure 3: File system check/repair (fsck) for four file systems and two hardware devices

It’s pretty obvious that ext3 is much slower than the other file systems on both storage media, but it’s really noticeable on the SATA drive. It took about 1,040 seconds to repair the file system while btrfs, which had the second worse time, took only about 90 seconds. However, notice how fast the file system was repaired on the PCIe SSD. Even ext3 took only about 80 seconds to repair 1 million files.

The final life cycle phase is the remove files from the file systems. Figure 4 below plots the time it took to remove all 1 million files from each file system for both storage media.

Figure 4: File remove for four file systems and two hardware devices

Notice that XFS is much slower than even ext3 on the SATA drive. It took XFS about 3,800 seconds (a little over an hour) to remove all 1 million files. The next slowest file system was ext3 and it took about 875 seconds (not quite 15 minutes) to remove the files. Ext4 was the fastest on the SATA drive but btrfs wasn’t too far behind. On the other hand, on the PCIe SSD, the slowest was btrfs followed by ext3, XFS, and then ext4. But the differences in time is very small.

While the previous tests were for 1 million files, they did point out some glaring differences in the various file systems. But Ric wanted to really push the envelope so he built a very large storage array of almost 100 TB’s using 2TB SATA drives and drive arrays. He formatted the file system with ext4 and then ran about the same tests as he did for the 1 million files but used 20KB files (1 billion of them). Here’s a quick summary of what he found.

Make the file system (mkfs)
- Approximately 4 hours
Fill the file system (1 billion files of about 20KB each)
- Approximately 4 days
Run a file system check (fsck) with 1 billion files
- 2.5 hours
XFS still has problems with meta-data intensive workloads.
Faster storage can be helpful. In particular, Ric mentioned that btrfs can use SSD’s for metadata and leave the bulk data on the slower storage.
- I hope everyone has also read my article series about a patch that allows you to use SSD’s to cache block devices (storage devices). There are other options for using SSD’s for caching including flashcache and one patch set that has caught the eye of developers gave btrfs the ability to take the “temperature” of data so that it could be moved to a faster or slower device as needed (the movement aspect was not in the proposed patch set).

Ric underscores that the rates are consistent for zero length files and small files (i.e. making the file really small didn’t help the overall performance rates).

The absolute best thing from this testing is that current file systems can handle 1 billion files. They may not be as fast as you want on SATA drives but they did function and you could actually fsck the file system (if you need more speed you can always flip for some MLC based SSD’s).

Ric talked about some specific lessons they learned from the testing:

When he fsck-ed the 1 billion file ext4 file system (a total of 70TB capacity), it took about 10GB of memory during the operation. This may sound like a great deal of memory and on today’s laptops and desktops this may be true, but on servers this amount of memory is fairly common.
Using xfs_repair (the XFS file system repair tool), on a large file system took almost 30GB of memory which is quite a bit of memory even for servers.

Ric also mentioned that with a file system with 1 billion files, running an “ls” command is perhaps not a good idea. The reason is that ls uses both the readdir and stat system functions which means that all of the metadata has to be touched twice. The result is that it takes a great deal of time to perform the “ls” operation. He did point out a way to reduce the time but performing an “ls” is still not a fast operation. Moreover, he did point out that the performance of file enumeration, which is what “ls” does to some degree, proceeds at the rate of file creation so it could take quite a while to perform the “ls” command.

There have been proposal for improving the responsiveness of “ls” for large file systems but nothing has been adopted universally. One proposal is to do what is termed “lazy updates” on metadata. In this concept, a summary of the metadata is kept within the file system so that when a user performs an “ls” operation, the summary is quickly read and the results are given to the user. However, the “lazy” part of the name implies that the summary data may not be absolutely accurate. It may not have the file sizes absolutely correct or it may not have the file that was created a microsecond prior to the “ls”. But the point of lazy updates is to allow users to get an idea of the status of their file system. Of course, “ls” could have an option such as “-accurate” that tells the command to use readdir and stat to get the most accurate state of the file system possible.

However, even this “accurate” option may not get you the most accurate information. Because there are so many files, by the time the last files have been accessed the status of the first files may have changed. To get “hyper-accurate” values for large file systems, you need to freeze the file system, perform the operation, and then continue to use the file system as normal. I’m not sure how many people would be willing to do this. But the problem is that I’ve seen people us the “ls” command as part of an application script. Not the brightest idea in the world in my opinion, but it allowed them to compute what they wanted.

Finally Ric underscored two other gotchas with extremely large file systems. The first one is the remote replication of backup to tape is a very long process. This is because, again, enumeration and the read rate of the file systems drop in terms of performance while other file system operations happen concurrently. This increases the length of time to perform an already long series of operations.

The second thing Ric highlighted was that operations such as backup, that take a long time, could be prone to failures (Ric actually used the words “will fail”). Consequently, for some of the these operations we will need to develop a checkpoint/restart capability and even do only a minimal number of IO retries when hitting a bad sector (currently many file systems will try several times with an increasing amount of time between retries - this increases the amount of time the file system will try to read a bad sector, holding up the whole enumeration/read processes).

Summary

While it sounds fairly simple, and conceptually it is, Ric’s file system experiments really highlight some limitations in our current file systems. With the increasing pressure of massive amounts of data, having file systems that can scale to billions of files is going to become a requirement (not a “nice-to-have”).

Ric’s simple tests on file systems with 1 million files can easily be done by anyone using small enough files. But these experiments really draws attention to the differences in the file systems as well as sheer amount of time it can take to performance file system tasks. But the really good news is that it is definitely possible to function with file systems with 1 million files if we are a little more patient around the length of time to complete certain operations.

Ric’s 1 billion file experiment was the really cool final experiment and we learned a great deal from it. Firstly, we learned that we can actually build and run 1 billion files in our current file systems. They may be slower but they can function. However, the time it takes to performance certain functions have underscored the differences in the file systems. But just as any experiment that pushes the edges of the proverbial envelope, we learned some new things:

Faster storage hardware really helps (perhaps we need to find a way to get SSD’s to be more effectively used in Linux and/or our current file systems
Doing an “ls” command on a 1 billion file file system is not the best idea. It looks as though we need to rethink how to make this function and others more efficient on really large file systems.
Performing backups or remote replication is going to be a very long process. This means that we have to be ready for failures during the processes pointing out the need for some sort of checkpoint/restart capability.

I want to thank Ric for his great presentation and permission to use the images.

TOPICS: Computers/Internet
KEYWORDS: filesystems; linux

Navigation: use the links below to view more comments.
first 1-20, 21-24 next last

1 posted on 10/08/2010 8:06:58 AM PDT by ShadowAce

[ Post Reply | Private Reply | View Replies]

To: rdb3; Calvinist_Dark_Lord; GodGunsandGuts; CyberCowboy777; Salo; Bobsat; JosephW; ...

2 posted on 10/08/2010 8:07:39 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Hmm. This guy should give me a call. We were putting 100+ million records on Linux Slackware systems using raw devices in 1995. Have since invented something that will easily hold billions of records that can span multiple machines across any network.

3 posted on 10/08/2010 8:17:28 AM PDT by isthisnickcool (Sharia? No thanks.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

I use ext4 on all of my Linux distros, if possible. It seems to me that SSDs are the next phase of the evolution of storage. Having recently purchased my first 1 TB SATA disk for my new gaming rig, I decided to run Windows 7’s system experience test on my system. I scored 7.6, 7.6, 7.7, and 7.7 respectively on each of the tests (CPU, memory, DirectX, and video), but my hard disk dumped my score down to a 5.9 (MS uses the lowest score as the final). I was shocked, to say the least, but my previous system had a 150 GB SATA disk at 10K RPM rotational speed and netted me a 6.5 on the same test.

Size truly does matter, but interface bandwidth and operating system disk operations appear to be the primary concerns.

4 posted on 10/08/2010 8:22:51 AM PDT by rarestia (It's time to water the Tree of Liberty.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: isthisnickcool

> We were putting 100+ million records ... easily hold billions of records...

I assume you mean a database. Most databases have a relatively small number of files on the filesystem, but the files may include millions or billions of records. That is, record =/= file.

So, are you storing each record in a separate file on the filesystem? If not, you're comparing apples and oranges.

5 posted on 10/08/2010 8:36:48 AM PDT by dayglored (Listen, strange women lying in ponds distributing swords is no basis for a system of government!)

[ Post Reply | Private Reply | To 3 | View Replies]

To: ShadowAce

Why wouldn’t you use a database?

6 posted on 10/08/2010 8:41:06 AM PDT by driftdiver (I could eat it raw, but why do that when I have a fire.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

http://en.wikipedia.org/wiki/ZFS

I thought that Apple almost put this into 10.5 (maybe), but had some license issues...Not really sure, but it is good to know that folks are in front of this.

7 posted on 10/08/2010 8:43:31 AM PDT by LearnsFromMistakes (Yes, I am happy to see you. But that IS a gun in my pocket.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: driftdiver

Database for what? This is about filesystems and the number of files--generically.

It's not necessarily about storing photos or databases.

8 posted on 10/08/2010 8:44:41 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 6 | View Replies]

To: ShadowAce

Why a database? Because in my experience trying to manage a high volume of individual data files is extremely difficult.

By manage I mean keep track of, update, backup, control access to and in general ensure the integrity of the data.

9 posted on 10/08/2010 8:46:46 AM PDT by driftdiver (I could eat it raw, but why do that when I have a fire.)

[ Post Reply | Private Reply | To 8 | View Replies]

To: ShadowAce

Regardless of whether you’re on an array, Windows with NTFS starts dying with only 20,000 or so files in a single folder. You’re sure to get a lock-up with half a million files.

10 posted on 10/08/2010 8:46:59 AM PDT by antiRepublicrat

[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

11 posted on 10/08/2010 8:48:47 AM PDT by isthisnickcool (Sharia? No thanks.)

[ Post Reply | Private Reply | To 1 | View Replies]

To: driftdiver

Because in my experience trying to manage a high volume of individual data files is extremely difficult.

Yes it can. However, managing thousands of users on a single filesystem with their /home directories in a database is not very feasible, is it?

12 posted on 10/08/2010 9:02:57 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 9 | View Replies]

To: isthisnickcool

See Post #5

13 posted on 10/08/2010 9:05:44 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 11 | View Replies]

To: driftdiver

Why a database? Because in my experience trying to manage a high volume of individual data files is extremely difficult.

On the flip side, managing terabyte-sized files is even more of a pain in my experience.

14 posted on 10/08/2010 12:06:49 PM PDT by antiRepublicrat

[ Post Reply | Private Reply | To 9 | View Replies]

To: antiRepublicrat

“managing terabyte-sized files is even more of a pain “

Dunno, the largest I’ve dealt with was 23TB. It made restoring from the DR site a bit challenging due to bandwidth limitations but all in all it wasn’t that bad.

15 posted on 10/08/2010 12:50:35 PM PDT by driftdiver (I could eat it raw, but why do that when I have a fire.)

[ Post Reply | Private Reply | To 14 | View Replies]

To: driftdiver

Dunno, the largest I’ve dealt with was 23TB.

You can split a database into multiple files on most systems. It makes dealing with them a little easier, and usually improves performance.

16 posted on 10/08/2010 12:52:47 PM PDT by antiRepublicrat

[ Post Reply | Private Reply | To 15 | View Replies]

To: antiRepublicrat

This was an oracle DB hosted on SAN. We used BCVs and all that. The database itself wasn’t a problem. It was moving 23 TB between datacenters that created an issue.

17 posted on 10/08/2010 12:57:42 PM PDT by driftdiver (I could eat it raw, but why do that when I have a fire.)

[ Post Reply | Private Reply | To 16 | View Replies]

To: driftdiver

It was moving 23 TB between datacenters that created an issue.

Never underestimate the bandwidth of a station wagon full of backup tapes.

18 posted on 10/08/2010 2:35:42 PM PDT by antiRepublicrat

[ Post Reply | Private Reply | To 17 | View Replies]

To: rarestia

I use ext4 on all of my Linux distros, if possible.

Hm. I tried ext4 but found it unreliable. Then again, it might have been when I migrated to Xubuntu 9.10 which, installed (once) on this system was so flaky I abandoned it for Xubuntu 9.04 and ext3. Stable.

Some flaky hardware? I dunno. Once in a while I have to pull the power plug for a few seconds and re-insert before getting the old box to boot.

19 posted on 10/08/2010 5:33:28 PM PDT by sionnsar (IranAzadi|5yst3m 0wn3d-it's N0t Y0ur5:SONY|TV--it's NOT news you can trust)

[ Post Reply | Private Reply | To 4 | View Replies]

To: ShadowAce

Great article! I found the comparison between filesystems were interesting. I'd like to see a similar comparison between ext4 and ntfs.

Hmmm... how many files do I have on my computer...

$ sudo find / -print | wc -l
605730

That's a lot of files IMO for a simple desktop computer, but nowhere near what they are talking about.

I've dealt with directories at work with 100k+ files in them (as a result of really stupid programmers in this case), and it's not pretty when you need to do cleanup there. Thank God for xargs!

File proliferation is actually a serious issue. Desktop users, especially those that have never figured out how to properly organize files can quickly get into a state where they can't find files they are looking for, even if they know it's somewhere on the computer. Your average user doesn't know anything about how to efficiently organize directories and subdirectories. Heck, I sometimes have problems with it myself, and find myself going back through and reorganizing things periodically just to keep things straight.

20 posted on 10/08/2010 7:32:57 PM PDT by zeugma (Ad Majorem Dei Gloriam)

[ Post Reply | Private Reply | To 1 | View Replies]

Navigation: use the links below to view more comments.
first 1-20, 21-24 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794