Linus Torvalds Doesn’t Understand User Space Storage

Linus Torvalds Doesn’t Understand User Space Storage
Gluster ^ | 28 June 2011 | Anand Babu

Posted on 06/29/2011 6:16:30 AM PDT by ShadowAce

I was directed to a recent mailing list post by Linus Torvalds on linux-fsdevel in which he derided the concept of user-space filesystems. Not a particular implementation, mind you, but the very concept of it.

Jeff Darcy, of Red Hat and CloudFS fame, wrote a wonderful response, which you should read first before continuing further.

From my perspective, as the creator of GlusterFS, Linus is rather blinkered on this issue. The fact is, user space advantages far outweigh kernel space advantages. You’ll notice that Linus pointed to no benchmarks or studies confirming his opinion, he merely presented his bias as if it were fact. It is not.

Hypervisors are the modern micro kernels. Microkernel is not about size, but about what should be in kernel mode. Linus’s ideas about filesystems are rather old. He thinks that it is a bad idea to push the filesystems to user space, leaving the memory manager to run in kernel mode. The bulk of the memory buffers are filesystem contents, and you need both of them to work together. This is true for root filesystems with relatively small amounts of data but not true when it comes to scalable storage systems. Don’t let the kernel manage the memory for you. In my opinion, Kernel-space does a poor job of handling large amounts of memory with 4k pages. If you see the bigger picture, disks and memory have grown much larger, and user requirements have grown 1000-fold. To handle today’s scalable, highly available storage needs, filesystems need to scale across multiple commodity systems, which is much easier to do in user space. Real bottlenecks come from the network/disk latencies, buffer-copying and chatty IPC/RPC communications. Kernel-user context switches are hardly visible in the broader picture, thus whatever performance improvements it offers are irrelevant. Better, then, to use the simpler, easier methods offered in user-space to satisfy modern storage needs. Operating systems run in user-space in virtualized and cloud environments, and kernel developers should over come this mental barrier.

Once upon a time, Linus eschewed microkernels for a monolithic architecture for sake of simplicity. One would hope that he would be able to grasp the reasons why simplicity wins in this case, too. Unfortunately, he seems to have learned the wrong lesson from the microkernel vs. monolithic kernel debates: instead of the lesson being that all important stuff gets thrown into the kernel, it should have been that simplicity outweighs insignificant improvements elsewhere. We have seen this in the growth of virtualization and cloud computing, where the tradeoff between new features and performance loss has proved to be irrelevant.

There are bigger issues to address. Simplicity is *the* key to scalability. Features like online self-healing, online upgrade, online node addition/removal, HTTP based object protocol support, compression/encryption support, HDFS APIs, and certificate based security are complex in their own right. Necessitating that they be in kernel space only adds to the complexity, thus hampering progress and development. Kernel mode programming is too complex, too restrictive and unsustainable in many ways. It is hard to find kernel hackers, hard to write code and debug in kernel mode, and it is hard to handle hardware reliability when you scale out due to multiple points of failure.

GlusterFS got its inspiration from the GNU Hurd kernel. Many years before, GNU Hurd was able to mount tar balls as a filesystem, FTP as a filesystem, and POP3 as an mbox file. Users could extend the operating system in clever ways. A FUSE-like user space architecture was an inherent part of the Hurd operating system design. Instead of treating filesystems as a module of the operating system, Hurd treated Filesystems as the operating system. All parts of the operating system were developed as stackable modules, and Hurd handled hardware abstraction. Didn’t we see the benefits of converging the volume manager, software RAID and filesystem in ZFS? GNU Hurd took it a step further, and GlusterFS brought it to the next level with Linux and other Unix kernels. It treats the Linux kernel as a microkernel that handles hardware abstraction and broaches the subject that everyone is thinking, if not stating outloud: the cloud is the operating system. In this brave new world, stuffing filesystems into kernel space is counter-productive and hinders development. GlusterFS has inverted the stack, with many traditional kernel space jobs now handled in user space.

In fact, when you begin to see the cloud and distributed computing as the future (and present), you realize that the entire nomenclature of user space vs. kernel space is anachronistic. In a world where entire operating systems sit inside virtualized containers in user space, what does it even mean to be kernel space any more? Looking at the broader trends, arguing against user space filesystems is like arguing against rising and falling tides. To suggest that nothing significant is accomplished in user space is to ignore all major computing advances of the last decade.

To solve 21st-century distributed computing problems, we needed 21st-century tools for the job, and we wrote them into GlusterFS. GlusterFS manages most of the operating system functionality within its own user space, from memory management, IO scheduling, volume management, NFS, and RDMA to RAID-like distribution. For memory management, it allocates large blocks for large files, resulting in far fewer page table entries, and it is easier to garbage collect in user space. Similarly with IO scheduling, GlusterFS uses elastic hashing across nodes and IO-threads within the nodes. It can scale threads on demand and group blocks belonging to the same inodes together, eliminating disk contention. GlusterFS does a better job of managing its memory or scheduling, and the Linux kernel doesn’t have an integrated approach. It is user-space storage implementations that have scaled GNU/Linux OS beyond petabytes seamlessly. That’s not my opinion, it’s a fact: the largest deployments in the world are all user-space. Whats wrong with FUSE simplying filesystem development to the level of toy making? :-)

Some toys are beautiful and work better than others.

TOPICS: Computers/Internet
KEYWORDS: filesystems; linux

Navigation: use the links below to view more comments.
first previous 1-20, 21-40, 41-42 next last

To: Verbosus

Ah - so I think you’re basically saying that Linus may not like it but Linux (or Linus) can not block it - so in that sense Linus’s opinion is pretty much just that - an opinion - something we all have - but will not determine the ultimate success or failure of the project.

I tend to agree with you - in computing we have all these seemingly bright lines -

1. code vs. data
2. hardware vs. software
3. kernel vs. userland
4. time vs. space

and many, many others.

As you go along you realize that for a given project or effort feature A or functionality B can be placed on one side of the line or the other but it’s all a design choice - not due to some immutable law of the universe.

As an example things like event loops move scheduling out of the kernel and into userland. But scheduling problems don’t go away (or get worse) - they just change form is all.
Same with many of these other seemingly bright lines.

21 posted on 06/29/2011 9:34:43 AM PDT by 2 Kool 2 Be 4-Gotten (Welcome to the USA - where every day is Backwards Day!)

[ Post Reply | Private Reply | To 19 | View Replies]

To: John O

B.I.N.G.O. and Bingo was his name-oh.

22 posted on 06/29/2011 9:41:53 AM PDT by BuckeyeTexan (There are those that break and bend. I'm the other kind. *4192*)

[ Post Reply | Private Reply | To 13 | View Replies]

To: krb

"Cloud people" think the cloud should include everything. They'd be totally okay if the cloud became SkyNet. ;p

(Obviously, I have no bias against the cloud. /s)

23 posted on 06/29/2011 10:09:22 AM PDT by BuckeyeTexan (There are those that break and bend. I'm the other kind. *4192*)

[ Post Reply | Private Reply | To 14 | View Replies]

To: 2 Kool 2 Be 4-Gotten

That’s true. There really aren’t solid lines anymore. There used to be - back in the day. Now, it’s all about design decisions, which are themselves the result of business & user requirements and programmer preference.

24 posted on 06/29/2011 10:26:00 AM PDT by BuckeyeTexan (There are those that break and bend. I'm the other kind. *4192*)

[ Post Reply | Private Reply | To 21 | View Replies]

To: 2 Kool 2 Be 4-Gotten

Unless you’re working in as close to a “true” micro-kernel environment as possible, there are going to be issues (IO driver calls, scheduling, synchronization, memory copies/DMA shuffles) which the kernel will have to perform on behalf of the user-land FS.

In general, the only places I’ve seen user-land file systems succeed is where absolutely everything possible is stripped out of the kernel - ie, no half-way measures in making it a micro-kernel. Clocks/timers, interrupts, synchronization & IPC primitives, VM and *possibly* threading are all that is left in the kernel. Even terminal-IO has to be pushed out of the kernel (except for actual interrupt handling to an actual serial terminal).

The only commercial micro-kernel I’ve seen actually pull all this off was QNX’s Neutrino. Unlike most research or hobby micro-kernels, Neutrino delivered the goods and delivered them well. Maybe there are other microkernels that are ‘for real’ now, but I’ve been away from the industry on this topic for 10 years.

Microkernels look oh-so-sexy on paper.

Then when you actually start putting together a system based on them, you get to see whether they actually work or not. I’ve seen many micro-kernels utterly crash and burn in evaluation from a severe lack of IO performance from not thinking *really* hard about the overhead incurred by pushing the file system out to user space. The folks at QNX thought *really* hard about this and pulled it off - but there is still overhead involved that you wouldn’t have if all the IO stuff were still wrapped into the kernel, with no memory context changes necessary.

25 posted on 06/29/2011 10:39:19 AM PDT by NVDave

[ Post Reply | Private Reply | To 4 | View Replies]

To: BuckeyeTexan

The “cloud people” are nothing more than the “mainframe people,” wearing different work attire. Instead of the blue suit, red tie and white shirt of the IBM/Amdahl/Fujitsu days, we now have beatniks and business casual hipsters of the Google era.

Different acronyms, same “your data is MY data” mentality. Instead of “DASD farm,” we now have “the cloud.” Instead of channel connects, RJE, Token Ring, SNA and so on, we now have the “Internet.”

My reaction: “Big whoop, kid. Tell me something I’ve not heard before. That means you need to do more than just change the nouns and buzzwords you’re using.”

There’s nothing I love more than getting kids all worked up when I listen to some of this stuff by saying: “Yea, OK kid, I’ve heard this before. Different buzzwords. I’m telling you that the generation before me did this already, and I came along to tell them that I was going to do something different with minicomputers and workstations. I’m going to tell you the same thing they told me: Show me something new.”

26 posted on 06/29/2011 10:45:19 AM PDT by NVDave

[ Post Reply | Private Reply | To 23 | View Replies]

To: NVDave

Very nice post. I too have some passing acquaintance with Neutrino. Seems very solid in the application I know about.

Your philosophy sounds virtually identical to that of Linus - I think Linus had this famous war with Tannenbaum back in the day and I think Linus has made all of the points you have made and as we all know I think Linus won’t THAT debate. I think one of Linus’s points is that when you arrive at a true microkernel one of the things that kills you is the IPC and/or the synchronization between the stuff that you pushed *out* of the kernel and that this doesn’t scale so the problem tends to get worse and worse and time goes on and overall complexity increases.

I suspect that you and Linus are both right - and that Neutrino is somehow a “special” beast.

27 posted on 06/29/2011 11:55:49 AM PDT by 2 Kool 2 Be 4-Gotten (Welcome to the USA - where every day is Backwards Day!)

[ Post Reply | Private Reply | To 25 | View Replies]

To: BuckeyeTexan

Exactly. For example just read an article in a trade rag about 2 chip designs where one chip is an FPGA which the other chip can choose to load with RTL image A, B, or C based on a runtime decision. Don’t know if this works in practice but it’s an interesting wrinkle.

28 posted on 06/29/2011 12:07:41 PM PDT by 2 Kool 2 Be 4-Gotten (Welcome to the USA - where every day is Backwards Day!)

[ Post Reply | Private Reply | To 24 | View Replies]

To: Texas Fossil; ShadowAce

I am not a big fan of the whole concept of “the cloud”. I simply do not see the point.
[...] I question the motives for “the cloud” and do not think it is in the interest of the user.

I am certainly with you in spirit - although having bookmarks, contacts, appointments, and todos in a uniform, easy to access place is indeed a very handy thing (for sync between laptop/phone/tablet/desktops, etc). However, that can just as easily be accomplished with an encrypted tunnel into your own server as using someone else's real estate. And owning your own website is really quite inexpensive - mine is unlimited, and costs about $100/yr - No doubt it is less secure than the servers in my basement, but still better than the 'cloud'. If folks understood how easy it is to set up these services I think the cloud would wither away.

29 posted on 06/29/2011 12:49:26 PM PDT by roamer_1 (Globalism is just socialism in a business suit.)

[ Post Reply | Private Reply | To 6 | View Replies]

To: NVDave

Well, there is one minor difference ... they’re lazy, we weren’t. We did stuff the hard way!

Java does everything for them. They won’t (and probably can’t) clean up after themselves. (*grumbles*)

Reusable objects my butt. They reinvent the wheel everytime I turn around and it’s usually less efficient. Sure, it looks great but most of the time it’s slower than dog snot and hogs every resource on the server. (*grumbles louder*)

30 posted on 06/29/2011 3:35:36 PM PDT by BuckeyeTexan (There are those that break and bend. I'm the other kind. *4192*)

[ Post Reply | Private Reply | To 26 | View Replies]

To: roamer_1

Thanks for input. I no longer have a website, but have had thoughts of putting up my own server.

Is it necessary to have a fixed IP on the server? I suppose it is. My local ISP has that option, but I am currently using DHCP.

31 posted on 06/29/2011 3:40:53 PM PDT by Texas Fossil (Government, even in its best state is but a necessary evil; in its worst state an intolerable one)

[ Post Reply | Private Reply | To 29 | View Replies]

To: 2 Kool 2 Be 4-Gotten

One of the reasons why Linus won that debate was that the state of the MMU on x86 chips back then was... crude by today’s standards.

Where I used Neutrino was on MIPS R4K or similar architectures, which had a MMU that was light years ahead of what was available on x86 even in the late 90’s.

If you’re working on a mapped memory system and you don’t have a really good MMU which can do *fast* mappings, the micro-kernel idea falls down fast. If you’re working on a mapped memory system and you don’t have enough TLB’s... the micro-kernel idea falls down fast too. If you’re going to do anything other than trivial IO in the system, you really have to solve the problem of how you get data from one user process memory space to another, FAST. Too many system designers are used to copying up out of kernel-mapped pages into user space. You have to start thinking in terms of simply re-mapping pages, and being clever about how you do it. You have to start being really clever about how to handle the sync/scheduling issues too, which ultimately gets down to the IPC mechanism. If you don’t have a very good IPC design that really addresses what people want to do with it... the whole thing implodes pretty quickly. I saw some commercial micro-kernels from Europe (france, actually) that looked oh-so-groovy on paper... and then when you went to use them, you found out that their “fast” IPC mechanism, while it looked slick on paper... had almost no actual uses in our application. Most everything in the real world needed to use their “slow” IPC mechanism and the implementation just folded under the IO load in seconds. Back then, an embedded system with 4MB was a BFD. Today, feh. 4GB isn’t unheard of in embedded systems. So today, having 4KB pages, or 8KB pages, and re-mapping a whole page for one byte of actual transfer... who cares? Why copy? Get the re-mapping as fast as possible and suddenly the IPC mechanism starts to scale. Back then, people weren’t willing to re-map pages for a couple of bytes - or even a TCP ACK packet.

So while Linus won that debate with Tannenbaum in, what, the early 90’s, the chips are different today. Today, we have systems which are commercially deployed which are based on real micro-kernels. One of the things that can be done in system architecture to improve this would be to push more and more functionality off the main CPU and into attached task-specific processors - GPU’s, IO processors, etc. This is how IBM viewed the kernel in the early 70’s, fer cryin’ out loud (insert another round of harrumphing on my part here... VM/370 could be viewed as a forerunner of the micro-kernel idea if we want...). Instead, we had these micro-computer kiddies too (*&(*& cheap to up and use anything other than the CPU for everything. Witness the early Macintosh systems - they used the 68K, unmapped, for everything. Video, sound, disk IO, serial IO, you name it. Well, that was OK when systems were small. The system programming on those early Mac OS’s was crude, to say the least. When I was hacking Macs back then, there was nothing akin to a semaphore or event flag. You had to do everything in loops - you’d do as much IO or whatever as you could, then stick your driver or task on a event/IO/interrupt/VBL loop queue, and wait to be called again. It was absurd. I think that packet IO off the ‘net should be handled by a CPU on the Ethernet card, for example. Why should I have to worry about maintaining a TCP connection in the kernel? Tell the card “Hey, doofus... I want to set up a pipe between you and this here IP address.... step and fetch and tell me when you’re done!” And the card should handle all the retries, windowing, buffering, etc. I should just have an in-order stream of data and a series of IPC’s coming at me to tell me that my data is done, whichever way it was going. Same deal for screen data, same deal for sound, disks, etc.

Today, OS X is... sorta a micro-kernel and it works OK. WinNT *could* have been taken down the road of being a micro-kernel, if Microsoft didn’t have cranial/rectal compaction. Unix systems can go there, certainly. But it gets really messy when you go halfway... and that’s what I’m getting at. What we see is a whole lot of systems that don’t “get” the micro-kernel “gestalt” if you will, and they keep failing to solve their problems outside priv’ed memory space or kernel access. It takes a bunch of work and thinking to break with the monolithic kernel ideas of the last 30+ years and say “No, we’re not putting this into kernel space... let’s find a way to do it in user space.”

It really is hard, lemme tell you. The temptation to just throw up your hands, look at the schedules and slap it into the kernel are terrific.

32 posted on 06/29/2011 4:25:16 PM PDT by NVDave

[ Post Reply | Private Reply | To 27 | View Replies]

To: Texas Fossil

Is it necessary to have a fixed IP on the server?

No--just use Dynamic DNS or No IP or another service like those.

33 posted on 06/30/2011 4:46:32 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)

[ Post Reply | Private Reply | To 31 | View Replies]

To: longtermmemmory

Good cloud analysis. I have a darker opinion of it. Cloud computing is the opposite tactic of a content portal but with the same goal -to bound the customer to your resources creating a defacto monopoly of service. It’s stingy too since subscribers are paying for access to host-based services. The end result will likely be that their access outside the subscrption DMZ will be restricted or additional fees applied -for security purposes of course.

Cloud computing is best left for mobile devices -for now. In a few years those devices will have the power and storage to avoid the cloud -it that’s still possible at the time.

34 posted on 06/30/2011 5:36:04 AM PDT by Justa (Obama, the Tee Party candidate.)

[ Post Reply | Private Reply | To 10 | View Replies]

To: NVDave

Great post Dave. I coop-ed at IBM for a summer and then some. Worked on VMS on the 370 and I think it might have been MVS? or something - can’t remember exactly. Had to have JCL to run jobs - the whole thing. It was a great experience - frustrating at first but one soon got the hang of it. Sounds like there is a lot of back to the future going on. Ah, but such is life.

I remember in my computer graphics class at grad school they talked about the cycle of reinvention or some such. How graphics processing cycles between being done all on the main CPU (minimize bus traffic) and offloaded to a GPU (computationally more efficient). And how this is predictable as the sun rising and setting.

It sounds like we’re sort of saying the same thing here but generalized to other devices like MMU’s, I/O devices and such but with the further twist that it ripples through into OS architecture. I know people talk about IBM’s channel processor reverentially - that the beauty of big iron was the ability to do industrial amounts of I/O and do it independent of the CPU.

I think the points that I can take away from this are:

1. Hardware drives software and has since the beginning of time.
2. OS’s can and do evolve in response to #1 but it takes time because you don’t revamp an OS overnight.
3. Just because something looks good on paper doesn’t mean it will fly.

Should be interesting to see where this is all headed. And where multicore fits into all of this.

35 posted on 06/30/2011 5:39:47 AM PDT by 2 Kool 2 Be 4-Gotten (Welcome to the USA - where every day is Backwards Day!)

[ Post Reply | Private Reply | To 32 | View Replies]

To: Texas Fossil

"I am not a big fan of the whole concept of “the cloud”."

Well I started Using the Amazon Cloud Player for Music. I can see the advantage of such but I am still leery of putting my music library out in cyberspace without such being backed up on multiple hard drives and CD/DVD hard copies.

36 posted on 06/30/2011 5:42:50 AM PDT by Mad Dawgg (If you're going to deny my 1st Amendment rights then I must proceed to the 2nd one...)

[ Post Reply | Private Reply | To 6 | View Replies]

To: ShadowAce

Thanks, I learned something today.

37 posted on 06/30/2011 5:43:55 AM PDT by Texas Fossil (Government, even in its best state is but a necessary evil; in its worst state an intolerable one)

[ Post Reply | Private Reply | To 33 | View Replies]

To: Mad Dawgg

I could see how it would be advantageous to place large files that are infrequently used on a “cloud” server, but I no longer handle huge data files.

For 5-1/2 years I composed a distributor catalog for a wholesaler. My data storage was on the company's AS400, but I maintained my images on a desktop computer. We set up a backup system for the images so my assistant could access and sync against that storage area. That worked pretty well. It was not a fast network, but it was a very very stable platform.

A lot of businesses still use AS400’s because of the stability.

That was a 100+ year old company who made the fatal mistake of buying a similar company in CA about 14 years ago. That mistake and the owner buying in on the “green” illusion caused the TX company to close. The CA company never showed a profit from the first day they bought the company to the last. The TX company never had a year they failed to make a profit. The CA company was incorporated separately, but they shared some common officers. Long story short, the TX company was/is being liquidated because of the hemorrhage from the CA company.

The TX company made the owner 1 million net the year before they began liquidating. But the CA company lost more than that. When sales dropped 10% due to the economy, the cash flow dropped with it. They got behind with their suppliers and then began liquidating.

I had worked for the TX company in 2 stints totalling just short of 20 years. Worked for a similar company near KS City for 14 years between the stints with the TX company.

38 posted on 06/30/2011 6:03:18 AM PDT by Texas Fossil (Government, even in its best state is but a necessary evil; in its worst state an intolerable one)

[ Post Reply | Private Reply | To 36 | View Replies]

To: Justa

when it comes to cloud computing.

Ask. For example, if your lawyer uses cloud computing, find another lawyer. Same with doctor etc.

cloud computing should be toxic to businesses.

39 posted on 06/30/2011 10:43:51 AM PDT by longtermmemmory (VOTE! http://www.senate.gov and http://www.house.gov)

[ Post Reply | Private Reply | To 34 | View Replies]

To: Texas Fossil

Sorry, I missed your post -

I no longer have a website, but have had thoughts of putting up my own server. Is it necessary to have a fixed IP on the server? I suppose it is. My local ISP has that option, but I am currently using DHCP.

Actually, I am in the middle of that problem right now - I used to have DSL with a fixed IP, and everything was swell. Redirected a subdomain from my website to it, and obtained instant gratification. Local Http, FTP, TightVnc, LDAP and iCal servers, and everything was sweet... All pumped through my web-facing server (Ubuntu), with the tasty parts inside a VPN tunnel.

BUT, I recently changed to cable for the awesome speed. Their 'fixed' IP is not so fixed, and I have big problems hooking up to it... So I went with temporarily redirecting my subdomain to a private folder on my website for now, and shipped it all up there... until I can figger out how to get their dynamically assigned 'fixed IPs' to work... At least I am getting to know the guys in their tech dept pretty well : ( ...

There are some hokey dns providers that basically run a tsr on your box that transmits it's ip addy periodically back to the provider, thereby getting around DHCP, but what I know to do, Fixed IP is the easy way.

I will let you know if I get this resolved with cable, because the method will seemingly and needfully require something different, but a true fixed IP is the best...

40 posted on 06/30/2011 1:00:17 PM PDT by roamer_1 (Globalism is just socialism in a business suit.)

[ Post Reply | Private Reply | To 31 | View Replies]

Navigation: use the links below to view more comments.
first previous 1-20, 21-40, 41-42 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794