Posted on 10/08/2010 8:06:52 AM PDT by ShadowAce
The world is awash in data. This fact is putting more and more pressure on file systems to efficiently scale to handle increasingly large amounts of data. Recently, Ric Wheeler from Redhat experimented with putting 1 Billion files in a single file system to understand what problems/issues the Linux community might face in the future. Lets see what happened
Awash in a sea of data
No one is going to argue that the amount of data we generate and want to keep is growing at an unprecedented rate. In a 2008 article blogger Dave Raffo highlighted some statistics from an IDC model of enterprise data growth rate, that unstructured data was increasing at about 61.7% CAGR (Compounded Annual Growth Rate). In addition, data in the cloud (Google, Facebook, etc.) was expected to increases at a rate of 91.8% through 2012. These are astonishing growth rates that are causing file system developers to either bite their finger nails to the quick or for them to start thinking about some fairly outlandish file system requirements.
As an example, on lwn, a poster mentioned that a single MRI instrument can produce 20,000 files in a single scan. In about 9 months they had already produced about 23 million files from a single MRI instrument.
Individuals are taking digital picture with just about everything they own with cell phone images being the most popular. These images get uploaded to desktop and laptops, and, hopefully, end up on backup drives. Many of these images are also uploaded to Facebook or flickr or even personal websites. I know of a friends daughter who just started college and already has over 15,000 pictures of which a majority are on Facebook. With a family of 4, each taking 5,000-10,000 pictures a year, you can easily generate 20,000-40,000 files per year. Then you throw in email, games, papers, Christmas cards, music and other sources of data, a family can easily generate 1 million files a year on a family desktop or NAS server.
So far weve been able to store this much data because 2TB drives are very common, and 3TB drives are right around the corner. These can be used to create storage arrays that easily hit the 100TB mark with relatively little financial strain. Plus we can buy 2-10 of these drives during sales at Frys and stuff them in a nice case to give us anywhere from 2TB to 20TB of space just in our home desktop.
There are huge questions surrounding all of this data and its storage. How can we search this data? How can we ensure that the data doesnt become corrupted? (sometimes that means making multiple copies so our storage requirements just doubled). How do we move data to/from our laptops, cell phones, desktops, to a more centrally controlled location? How do we backup all of this data? But perhaps one of the more fundamental questions is, can our storage devices, specifically our file systems, store this much data and still be able to function?
Smaller Scale Testing
Recently, Ric Wheeler from Redhat started doing some experiments with file systems to understand their limitations in regard to scale. In particular, he wanted to try loading up Linux file systems with 1 billion files as an experiment and see what happened.
As Ric pointed out in a presentation he made at 2010 LinuxCon, 1 billion files is very conceivable from a capacity perspective. If you use 1KB files, then 1 billion files (1,000,000,000) takes up only 1TB. If you use 10KB files, then you need 10TBs to accommodate 1 billion files (not to difficult to imagine even in a home system). If you 100KB files, then you need 100TBs to hold 1 billion files. Again its not hard to imagine 100 TBs in a storage array. The point is that with smaller files, current storage devices can easily accommodate 1 billion files from a capacity perspective.
Ric built a 100TB storage array (raw capacity) for performing some tests. But, as previously mentioned, you dont need much capacity for performing these experiments. According to Ric the life cycle of a file system has several stages
To understand the amount of time it takes to create file systems, Ric performed a simple experiment by first creating a 1TB file system using the four file systems on a SATA disk. To understand if the bottleneck was the performance of the storage media itself, he also built a 75GB file system on a PCIe SSD. Figure 1 below is from his LinuxCon talk plotting the amount of time it took for each file system to be created.
Figure 1: File System make (mkfs) for four file systems and two hardware devices
The second phase of the file systems life cycle is to fill the file system with data. Recall that Ric is using 1 million files to fill up the file systems. He used 1,000 directories with 1,000 file each. Figure 2 below shows the amount of time it took to create 1 million files on the two types of hardware - 1TB file system on a SATA drive and a 75GB file system on a PCIe SSD.
The fourth phase in the life cycle is to repair the file system. Figure 3 below plots the amount of time it takes to repair the 1 million file configurations.
Figure 3: File system check/repair (fsck) for four file systems and two hardware devices
The final life cycle phase is the remove files from the file systems. Figure 4 below plots the time it took to remove all 1 million files from each file system for both storage media.
While the previous tests were for 1 million files, they did point out some glaring differences in the various file systems. But Ric wanted to really push the envelope so he built a very large storage array of almost 100 TBs using 2TB SATA drives and drive arrays. He formatted the file system with ext4 and then ran about the same tests as he did for the 1 million files but used 20KB files (1 billion of them). Heres a quick summary of what he found.
The absolute best thing from this testing is that current file systems can handle 1 billion files. They may not be as fast as you want on SATA drives but they did function and you could actually fsck the file system (if you need more speed you can always flip for some MLC based SSDs).
Ric talked about some specific lessons they learned from the testing:
Ric also mentioned that with a file system with 1 billion files, running an ls command is perhaps not a good idea. The reason is that ls uses both the readdir and stat system functions which means that all of the metadata has to be touched twice. The result is that it takes a great deal of time to perform the ls operation. He did point out a way to reduce the time but performing an ls is still not a fast operation. Moreover, he did point out that the performance of file enumeration, which is what ls does to some degree, proceeds at the rate of file creation so it could take quite a while to perform the ls command.
There have been proposal for improving the responsiveness of ls for large file systems but nothing has been adopted universally. One proposal is to do what is termed lazy updates on metadata. In this concept, a summary of the metadata is kept within the file system so that when a user performs an ls operation, the summary is quickly read and the results are given to the user. However, the lazy part of the name implies that the summary data may not be absolutely accurate. It may not have the file sizes absolutely correct or it may not have the file that was created a microsecond prior to the ls. But the point of lazy updates is to allow users to get an idea of the status of their file system. Of course, ls could have an option such as -accurate that tells the command to use readdir and stat to get the most accurate state of the file system possible.
However, even this accurate option may not get you the most accurate information. Because there are so many files, by the time the last files have been accessed the status of the first files may have changed. To get hyper-accurate values for large file systems, you need to freeze the file system, perform the operation, and then continue to use the file system as normal. Im not sure how many people would be willing to do this. But the problem is that Ive seen people us the ls command as part of an application script. Not the brightest idea in the world in my opinion, but it allowed them to compute what they wanted.
Finally Ric underscored two other gotchas with extremely large file systems. The first one is the remote replication of backup to tape is a very long process. This is because, again, enumeration and the read rate of the file systems drop in terms of performance while other file system operations happen concurrently. This increases the length of time to perform an already long series of operations.
The second thing Ric highlighted was that operations such as backup, that take a long time, could be prone to failures (Ric actually used the words will fail). Consequently, for some of the these operations we will need to develop a checkpoint/restart capability and even do only a minimal number of IO retries when hitting a bad sector (currently many file systems will try several times with an increasing amount of time between retries - this increases the amount of time the file system will try to read a bad sector, holding up the whole enumeration/read processes).
Summary
While it sounds fairly simple, and conceptually it is, Rics file system experiments really highlight some limitations in our current file systems. With the increasing pressure of massive amounts of data, having file systems that can scale to billions of files is going to become a requirement (not a nice-to-have).
Rics simple tests on file systems with 1 million files can easily be done by anyone using small enough files. But these experiments really draws attention to the differences in the file systems as well as sheer amount of time it can take to performance file system tasks. But the really good news is that it is definitely possible to function with file systems with 1 million files if we are a little more patient around the length of time to complete certain operations.
Rics 1 billion file experiment was the really cool final experiment and we learned a great deal from it. Firstly, we learned that we can actually build and run 1 billion files in our current file systems. They may be slower but they can function. However, the time it takes to performance certain functions have underscored the differences in the file systems. But just as any experiment that pushes the edges of the proverbial envelope, we learned some new things:
I want to thank Ric for his great presentation and permission to use the images.
Hmm. This guy should give me a call. We were putting 100+ million records on Linux Slackware systems using raw devices in 1995. Have since invented something that will easily hold billions of records that can span multiple machines across any network.
I use ext4 on all of my Linux distros, if possible. It seems to me that SSDs are the next phase of the evolution of storage. Having recently purchased my first 1 TB SATA disk for my new gaming rig, I decided to run Windows 7’s system experience test on my system. I scored 7.6, 7.6, 7.7, and 7.7 respectively on each of the tests (CPU, memory, DirectX, and video), but my hard disk dumped my score down to a 5.9 (MS uses the lowest score as the final). I was shocked, to say the least, but my previous system had a 150 GB SATA disk at 10K RPM rotational speed and netted me a 6.5 on the same test.
Size truly does matter, but interface bandwidth and operating system disk operations appear to be the primary concerns.
I assume you mean a database. Most databases have a relatively small number of files on the filesystem, but the files may include millions or billions of records. That is, record =/= file.
So, are you storing each record in a separate file on the filesystem? If not, you're comparing apples and oranges.
Why wouldn’t you use a database?
http://en.wikipedia.org/wiki/ZFS
I thought that Apple almost put this into 10.5 (maybe), but had some license issues...Not really sure, but it is good to know that folks are in front of this.
It's not necessarily about storing photos or databases.
Why a database? Because in my experience trying to manage a high volume of individual data files is extremely difficult.
By manage I mean keep track of, update, backup, control access to and in general ensure the integrity of the data.
Regardless of whether you’re on an array, Windows with NTFS starts dying with only 20,000 or so files in a single folder. You’re sure to get a lock-up with half a million files.
Hmm. This guy should give me a call. We were putting 100+ million records on Linux Slackware systems using raw devices in 1995. Have since invented something that will easily hold billions of records that can span multiple machines across any network.
Yes it can. However, managing thousands of users on a single filesystem with their /home directories in a database is not very feasible, is it?
See Post #5
On the flip side, managing terabyte-sized files is even more of a pain in my experience.
“managing terabyte-sized files is even more of a pain “
Dunno, the largest I’ve dealt with was 23TB. It made restoring from the DR site a bit challenging due to bandwidth limitations but all in all it wasn’t that bad.
You can split a database into multiple files on most systems. It makes dealing with them a little easier, and usually improves performance.
This was an oracle DB hosted on SAN. We used BCVs and all that. The database itself wasn’t a problem. It was moving 23 TB between datacenters that created an issue.
Never underestimate the bandwidth of a station wagon full of backup tapes.
Hm. I tried ext4 but found it unreliable. Then again, it might have been when I migrated to Xubuntu 9.10 which, installed (once) on this system was so flaky I abandoned it for Xubuntu 9.04 and ext3. Stable.
Some flaky hardware? I dunno. Once in a while I have to pull the power plug for a few seconds and re-insert before getting the old box to boot.
Hmmm... how many files do I have on my computer...
$ sudo find / -print | wc -l
605730
That's a lot of files IMO for a simple desktop computer, but nowhere near what they are talking about.
I've dealt with directories at work with 100k+ files in them (as a result of really stupid programmers in this case), and it's not pretty when you need to do cleanup there. Thank God for xargs!
File proliferation is actually a serious issue. Desktop users, especially those that have never figured out how to properly organize files can quickly get into a state where they can't find files they are looking for, even if they know it's somewhere on the computer. Your average user doesn't know anything about how to efficiently organize directories and subdirectories. Heck, I sometimes have problems with it myself, and find myself going back through and reorganizing things periodically just to keep things straight.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.