No, It's Not Always Quicker To Do Things In Memory

← Back to Stories (view on slashdot.org)

No, It's Not Always Quicker To Do Things In Memory

Posted by Soulskill on Wednesday March 25, 2015 @03:45AM from the performance-that-fails-to-perform dept.

itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).

7 of 486 comments (clear)

Min score:

Reason:

Sort:

This is the dumbest research I've seen this year by MobyDisk · 2015-03-25 03:53 · Score: 5, Informative

This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!
Re:It depends by greg1104 · 2015-03-25 04:09 · Score: 5, Informative

SSDs and disk speed have nothing to do with this. None of these writes are hitting disk. All they've shown is that when you cache a write to disk, the operating system might add data to it more efficiently than the slow Python and Java string code can expand a string.
Re:Check their work or check the summary? by bondsbw · 2015-03-25 04:23 · Score: 5, Informative

Specifically, the time measured to write to memory uses the following code:
for (int i=0; i < numIter; i++) { concatString += addString; }
The time measured to write to disk uses the following code:
for (int i=0; i < numIter; i++) { writer.write(addString); } writer.flush(); writer.close();
In Java, strings are immutable. Each string concatenation produces a new string on the heap, and the old string is unchanged. So there are numIter strings created in memory, and I assume garbage collection will probably happen at some point once enough memory is used. O(n) reads and O(n) writes to the heap with O(n^2) memory usage plus an unknown number of garbage collections. This can cause considerable slowing of the in-memory algorithm.
That algorithm is then compared with one that does numIter writes to a buffer, which is then flushed to disk at the end. O(n) writes to memory buffer (no need to re-read memory) using O(n) memory space, followed by O(1) writes to disk and O(n) disk space used.
Granted, it's been over a decade since I took algorithms so I wouldn't doubt that someone can show how I am off, but this kind of thing should be simple to spot for anyone who has an undergrad CS degree.
PS - I love how the paper makes this aside as if it doesn't matter tremendously:

Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.

--
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
Re:It depends by PacoSuarez · 2015-03-25 04:55 · Score: 4, Informative

[...] For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second.
I didn't RTFA, but after reading this I am certainly not going to. This C++ piece of code takes around 0.01 seconds to run on my computer:
#include <iostream>
#include <string>
void build_string(std::string &s, std::string r) {
for (int i = 0; i < 1000000; ++i)
s += r;
}
int main() {
std::string s;
build_string(s, "a");
std::cout s.length() '\n';
}
python and java by Spazmania · 2015-03-25 05:10 · Score: 4, Informative

They tested using strings in python and java, both of whose string libraries are very much overweight. And they tested by concatinating strings in a way that requires constant reallocations and memory copies versus pushing data to fixed size disk buffers in the OS cache.
So... surprise! When writing data sequentially the C implementation of disk buffers is faster than the java and python implementations of strings.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:It depends by gnupun · 2015-03-25 06:50 · Score: 4, Informative

There's nothing wrong with Java or Python, but the programmer is inexperienced. Java and Python strings are immutable. So, any time they concatenate a single character to an existing string, the Java runtime creates a brand new string, leaving the original string intact (since it is immutable). So if they create a million character string using using million concatenations, guess what, a million new strings are created and that's very slow. A better solution is to use a mutable String aka, StringBuilder.
But the right solution is to use a small buffer, say 16KB to 100KB in size, fill that with characters and flush that buffer to disk every time it's full. The speed would be same as any other method, but the max memory used is 20x smaller.
Re:Check their work or check the summary? by OverlordQ · 2015-03-25 07:50 · Score: 4, Informative

THATS THE ENTIRE POINT OF THIS PAPER.

It is easy to explain the results: In high-level languages such as Java and Python, a seemingly benign
statement such as concatString += addString may actually involve executing many extra cycles behind
the scenes. To concatenate two strings in a language such as C, if there is not enough space to expand
the concatString to the size it needs to be to hold the additional bytes from addString, then the
developer has to explicitly allocate new space with enough storage for the sum of the sizes of the two
strings and copy concatString to the new location, and then finally perform the concatenation. In Java
and Python strings are immutable, and any assignment will result in the creation of a new object and
possibly copy operations, hence the overhead of the string operations. The disk-only code, although
apparently writing to the disk excessively, is only triggering an actual write when operating system
buffers are full. In other words, the operating system already lessons disk access times. A developer
familiar with the language and system internals readily notices the causes of this observed behaviour,
but this behaviour may be easily missed, as indicated by examining similar cases in production code.

--
Your hair look like poop, Bob! - Wanker.