Posted on 03/26/2015 8:27:11 PM PDT by Utilizer
Its a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. The growth of big data has made time saving techniques such as performing operations in-memory more attractive than ever for programmers. New research, though, challenges the notion that in-memory operations are always faster than disk-access approaches and reinforces the need for developers to better understand system-level software.
These findings were recently presented by researchers from the University of Calgary and the University of British Columbia in a paper titled When In-Memory Computing is Slower than Heavy Disk Usage. They tested this assumption that working in-memory is necessarily faster than doing lots of disk writes using a simple example. Specifically, they compared the efficiency of alternative ways to create a 1MB string and write it to disk. An in-memory version concatenated strings of fixed sizes (first 1 byte then 10 then 1,000 then 1,000,000 bytes) in-memory, then wrote the result to disk (a single write). The disk-only approach wrote the strings directly to disk (e.g., 1,000,000 writes of 1 bytes strings, 100,000 writes of 10 byte strings, etc.).
(Excerpt) Read more at itworld.com ...
Right. In the disk method, the disk driver is performing at least a partial concatenation before actually writing to disk. The disk driver has more efficient code for concatenation than the code generated by Java (no surprise there) or Python.
Yeah, right. Like that's gonna happen. The monkeys churning out code today probably think Big Endian and Little Endian is a children's book about Native Americans.
Amen to that! Just a bunch of monkeys.
Id go with 10%.
They leave HDDs in their dust:
http://techreport.com/r.x/samsung-850evo/crystal-read.gif
http://techreport.com/r.x/samsung-850evo/crystal-write.gif
http://techreport.com/r.x/samsung-850evo/db2-read.gif
http://techreport.com/r.x/samsung-850evo/db2-write.gif
doing the operation in memory then doing a single 1m write to disk is still FAR faster then 1m 1 byte writes followed by 100k of 10bytes, etc.
even if the memory version was written as a single byte at a time, it would equate to the 1m 1 byte writes. the other writes would be slower
Generally, you should set a buffer size of 4-16K and format your app's output directly into the buffer, if possible. You may wish to use multiple 4-16K buffers, so that you are writing into the current buffer while one of your past buffers is being transferred to disk asynchronously. When you fill the current buffer, it should be queued for output, and you should switch your output-formatting activity to scribble on a previous buffer which has already been written. When you are done, you should remember to queue your final buffer for output and wait until all buffers have been written. Then please close the file.
The optimal buffer size and number of buffers should be determined by experiment.
Thanks for the tip, mate. :)
Disk drivers don't know squat about concatenation. They just know about "write this block of memory to this chunk of disk blocks". Of course, what happens next will depend on whether the HD is buffered or whether it's not an HD but an SSD, etc.
Are you talking 1970 or 2015?
If the problem fits in memory, then it can be solved in memory far faster than on disk. If not, then you need a strategy that takes the disparity of access times into account.
E.g., if the problem is sorting the donor file, then you need some sort of algorithm in which sorted subsets are written to disk, then read in and merged, written out again, until you end up with sorted output. Of course, if it's 2015, you just read in the damned file and sort it! Done!
In 2015, your laptop or your smartphone likely has way more RAM than a major glassed in, raised floor computer installation of the 1970's or 1980's had RAM plus disk.
Sorry, but my comment was not directed at you personally, but the author of the original article. My point being that the premise is false except for contrived tests that utilize memory at it's most inefficient and optimizes the hard drive.
One million single writes to disk can be a much different proposition if the test has the disk all to itself than it is on a busy system where every write operation can potentially have to get queued and wait for some other process to release the disk channel.
IMHO
“25 years ... half a dozen”
“half at best, 1 out of 10 at worst”
“10%”
Thanks for the replies. It’s nice to know I’m not the only one in this camp. My answer is 10%.
If I was a bit more of an entrepreneur, these markers are the ones I would look for when hiring programmers because within the 10%, the answer tends to be ... 10% - which tells me there is an ability block that “thinks this way”.
Alas, I work for a living ~
I'm sure it is possible to construct very narrowly tailored circumstances where what they are describing makes sense, but it's such an artificial construct that it's not really useful. It's simply a reminder to never use the word 'never'.
It only proves that you can design a test to do stupid things that don't really apply in the real world.
First and foremost, is the fact that memory is everything. In order for a process to write to disk, it must first put that data in a buffer, which is (gasp) MEMORY. In most modern, enterprise level systems, there is a ton of cache (more memory) sitting in the disk subsystem to receive the data from the operating system prior to it being written to disk.
Let's see them run an application or database doing real-world work and see how their theory holds up. I got $100 that says "not very well"
In simple terms the test was to get a string of bits written to the disk in a given order
So a one step operation—write to the disk— is faster then a two step operation—organize the bits in memory—then write to the disk....
Gee that a shock...(/sarcasm off)
ANYTHING can be done faster in memory than on Disk if the problem is properly stated and the program is properly constructed. I can imagine, easily, situations that either could result in more rapid performance given (essentially) unlimited space on disk but not in memory. (a problem with “sparse matrices”)
When performance means money (server time, server sizing, etc...), it can actually make sense to do these types of tests.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.