No, It's Not Always Quicker To Do Things In Memory
itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).
This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!
SSDs and disk speed have nothing to do with this. None of these writes are hitting disk. All they've shown is that when you cache a write to disk, the operating system might add data to it more efficiently than the slow Python and Java string code can expand a string.
You don't even have to read the code. Reading what languages they used reveals the entire flaw. They used languages with expensive string operations when done in-memory which is the only reason why writing to a buffered cache and writing to disk is faster.
Specifically, the time measured to write to memory uses the following code:
for (int i=0; i < numIter; i++) {
concatString += addString;
}
The time measured to write to disk uses the following code:
for (int i=0; i < numIter; i++) {
writer.write(addString);
}
writer.flush();
writer.close();
In Java, strings are immutable. Each string concatenation produces a new string on the heap, and the old string is unchanged. So there are numIter strings created in memory, and I assume garbage collection will probably happen at some point once enough memory is used. O(n) reads and O(n) writes to the heap with O(n^2) memory usage plus an unknown number of garbage collections. This can cause considerable slowing of the in-memory algorithm.
That algorithm is then compared with one that does numIter writes to a buffer, which is then flushed to disk at the end. O(n) writes to memory buffer (no need to re-read memory) using O(n) memory space, followed by O(1) writes to disk and O(n) disk space used.
Granted, it's been over a decade since I took algorithms so I wouldn't doubt that someone can show how I am off, but this kind of thing should be simple to spot for anyone who has an undergrad CS degree.
PS - I love how the paper makes this aside as if it doesn't matter tremendously:
Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
In general writing to RAM is faster than writing to the disk. However there are things that get in the way of both.
1. OS Memory Management: So you making a small memory string to a big one. So will the os fragment the string, when it comes up to an other systems reserved memory spot. Will it overwrite it (Buffer overflow), will it find a contiguous larger memory block and copy the data there. Will it copy and move the memory slots to a new location away from the memory. Will this be happening preemptively, or when the error condition occurs, will all this stuff happen with a cpu cycle that is not sharing with your app. Also if you are low on memory the system may dump it to the disk anyways.
2. OS Disk management: A lot of the same concerns that memory management has. However a bunch of small request is easier to find free space, then asking for a larger spot. So they may be more seek time.
3. Disk Caching: You tell the program to append to the disk. The OS sends the data to the drive, the drive responds back Yea I got it. then the OS goes back to handling your app, in the mean time your drive is actually spinning to save the data on the disk.
4. How your compiler handles the memory. Data = Data + "STRING" vs. Data+="STRING" vs Data.Append("STRING") vs { DataVal2=malloc(6); DataVal2="STRING"; DataRec->Next = *DataVal2; } You could be spending O(n) time saving your memory where you can be doing in in O(1)
Now sometime I do change my algorithm to write to the disk vs. handling it in memory. Mostly because the data I am processing is huge, and I much rather sacrifice speed, in order to insure that the data gets written.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
[...] For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second.
I didn't RTFA, but after reading this I am certainly not going to. This C++ piece of code takes around 0.01 seconds to run on my computer:
#include <iostream>
#include <string>
void build_string(std::string &s, std::string r) {
for (int i = 0; i < 1000000; ++i)
s += r;
}
int main() {
std::string s;
build_string(s, "a");
std::cout s.length() '\n';
}
Changed Java code to use StringBuilder instead of String += String. Results on my machine:
1: 0.010625
10: 0.002375
100: 0.001
Maybe somebody who study Chemical and Biological Science is not good developer
They tested using strings in python and java, both of whose string libraries are very much overweight. And they tested by concatinating strings in a way that requires constant reallocations and memory copies versus pushing data to fixed size disk buffers in the OS cache.
So... surprise! When writing data sequentially the C implementation of disk buffers is faster than the java and python implementations of strings.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
It's exactly this. The Java code they wrote uses String, resulting in an O(n^2) algorithm. A trivial change to StringBuilder would result in an improvement to O(n).
The paper is just embarrassing.
There's nothing wrong with Java or Python, but the programmer is inexperienced. Java and Python strings are immutable. So, any time they concatenate a single character to an existing string, the Java runtime creates a brand new string, leaving the original string intact (since it is immutable). So if they create a million character string using using million concatenations, guess what, a million new strings are created and that's very slow. A better solution is to use a mutable String aka, StringBuilder.
But the right solution is to use a small buffer, say 16KB to 100KB in size, fill that with characters and flush that buffer to disk every time it's full. The speed would be same as any other method, but the max memory used is 20x smaller.
It is even worse than that. They are using a BufferedWriter for the so called writing to disk portion. So, they are actually comparing the worst possible way to append a String in memory to appending bytes to a bytes buffer and periodically writing that to disk. So, basically comparing two different in memory string appending techniques? When you bring the OS into play it is even less likely to actually show anything having to do with the disk because the OS will write asynchronously.
My grade: F- and they should be mocked mercilessly until the paper is retracted for being idiotic.
THATS THE ENTIRE POINT OF THIS PAPER.
Your hair look like poop, Bob! - Wanker.