No, It's Not Always Quicker To Do Things In Memory
itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).
'll have to dig through their testing and methods, but this seems pretty fishy given the summary.
Seek/Read/Write time of a disk is always slower than memory. No exceptions to the rule exist given current commodity hardware. Bus length to a disk is also much longer than to memory. Again, there are no exceptions given commodity hardware.
Won't be the first time someone reported that the laws of physics don't exist for something, and I'm sure it won't be the last. Maybe someone with free mornings in the US can break it down better than the summary.
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.
RAM *is* faster (by far) than any persistent media 9SSD, HD...). So whatever the test, the algorithm is probably bad,
Slashdot, fix the reply notifications... You won't get away with it...
A more accurate title would be: "You can be sufficiently stupid with your memory access that it's faster to do disk IO."
Java is not the only system that can manifest this.
A Pirate and a Puritan look the same on a balance sheet.
Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.
Indeed. Their results make no sense. They are doing something weird. For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second. Their code is listed at the end of the paper, and they seem to be assuming that "flush" means the code is actually written to disk. It does not. It just means the bytes were passed to the operating system.
The real story here, is that if you don't know how to write code properly, then string concatenation can be really slow.
Was their paper peer reviewed?
Sorry but you'll need to do it without using any memory. We need to make it fast.
Memory bandwidth is about 20Gb/s. Disk bandwidth is about 0.05Gb/s. The performance consequences of this are obvious to anyone who knows how basic arithmetic works.
The results they got are invalid because their test framework is broken. This is exactly why everyone should be forced to learn C/C++ or Assembler in college/university. The reason for the crap result is they did not preallocate their buffers so they wasted all their execution time allocating and reallocating larger buffers from the heap. The disk APIs have their own internal buffer implementations, that were not written by idiots, that manage this correctly which is the cause of the difference.
I just scanned the paper, because their claim seem to be idiotic. It looks like they are appending a single byte on the end of a string in memory and on disk. For the memory operation, this will result in a string copy since strings are immutable, vs. doing a one byte file append onto the disk. The former is increasingly expensive and the latter is a fixed cost, so after infinite operations, the disk cost becomes far less than the memory operation. If this is indeed their claim, and I am not missing something, then they should be collectively slapped for wasting our time by writing this paper. If this is really your use case, write some proper data structures to manage your data in a sane fashion.
So yes, if you do stupid things, you can make bad engineering decisions look like good ones.
HA! I just wasted some of your bandwidth with a frivolous sig!
It makes perfect sense once you read the paper. The conclusion is techniocally correct but deceptive.
The results apply in the case of Java and Python where strings are immutable objects. They also used buffered I/O handled by libc. When you concatenate immutable strings, you must allocate a new string large enough to hold both parts, then a memcpy from both of the parts is performed to construct it. The parts are eventually garbage collected.
In contrast, writing to a file with buffered I/O means just copying the additional write buffer to the current end of the buffer and moving updating the accounting information.
As a result, in both cases, only one actual filesystem transaction takes place writing out the complete string. Thus, the actual practical difference between the two methods is that the 'in memory' version copies the memory around many times while the 'disk i/o' one copies the data once (in multiple steps, but each byte sees one copy).
That seems like a bit of a no-brainer, but the point is valid because many programmers may deceive themselves into thinking the 'in memory' method is faster because they don't take the file i/o buffering and the way immutable strings are handled into account.
How in the world? Trivially. They're doing it in an O(n^2) way - it's the only explanation.
If you use string concat library code naively, you can end up "copy the string, add one byte, repeat" easily enough in languages like Java. And it's not exactly breakthrough research to discover that O(n) disk can be faster than O(n^2) memory for large enough n.
Socialism: a lie told by totalitarians and believed by fools.
And they're using BufferedWriter to write to the file which, as the name suggests, is buffering the data *in memory* before writing it.
So the result of the paper is actually O(n) in memory algorithm outperforms O(n^2) in memory algorithm for data sizes of 1MB. Hardly surprising.