Slashdot Mirror


No, It's Not Always Quicker To Do Things In Memory

itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).

12 of 486 comments (clear)

  1. This is the dumbest research I've seen this year by MobyDisk · · Score: 5, Informative

    This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!

  2. Re:It depends by greg1104 · · Score: 5, Informative

    SSDs and disk speed have nothing to do with this. None of these writes are hitting disk. All they've shown is that when you cache a write to disk, the operating system might add data to it more efficiently than the slow Python and Java string code can expand a string.

  3. Re:Check their work or check the summary? by Anonymous Coward · · Score: 2, Informative

    You don't even have to read the code. Reading what languages they used reveals the entire flaw. They used languages with expensive string operations when done in-memory which is the only reason why writing to a buffered cache and writing to disk is faster.

  4. Re:Check their work or check the summary? by bondsbw · · Score: 5, Informative

    Specifically, the time measured to write to memory uses the following code:

      for (int i=0; i < numIter; i++) {
              concatString += addString;
      }

    The time measured to write to disk uses the following code:

      for (int i=0; i < numIter; i++) {
              writer.write(addString);
      }
      writer.flush();
      writer.close();

    In Java, strings are immutable. Each string concatenation produces a new string on the heap, and the old string is unchanged. So there are numIter strings created in memory, and I assume garbage collection will probably happen at some point once enough memory is used. O(n) reads and O(n) writes to the heap with O(n^2) memory usage plus an unknown number of garbage collections. This can cause considerable slowing of the in-memory algorithm.

    That algorithm is then compared with one that does numIter writes to a buffer, which is then flushed to disk at the end. O(n) writes to memory buffer (no need to re-read memory) using O(n) memory space, followed by O(1) writes to disk and O(n) disk space used.

    Granted, it's been over a decade since I took algorithms so I wouldn't doubt that someone can show how I am off, but this kind of thing should be simple to spot for anyone who has an undergrad CS degree.

    PS - I love how the paper makes this aside as if it doesn't matter tremendously:

    Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.

    --
    All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
  5. Re:It depends by jellomizer · · Score: 3, Informative

    In general writing to RAM is faster than writing to the disk. However there are things that get in the way of both.
    1. OS Memory Management: So you making a small memory string to a big one. So will the os fragment the string, when it comes up to an other systems reserved memory spot. Will it overwrite it (Buffer overflow), will it find a contiguous larger memory block and copy the data there. Will it copy and move the memory slots to a new location away from the memory. Will this be happening preemptively, or when the error condition occurs, will all this stuff happen with a cpu cycle that is not sharing with your app. Also if you are low on memory the system may dump it to the disk anyways.

    2. OS Disk management: A lot of the same concerns that memory management has. However a bunch of small request is easier to find free space, then asking for a larger spot. So they may be more seek time.

    3. Disk Caching: You tell the program to append to the disk. The OS sends the data to the drive, the drive responds back Yea I got it. then the OS goes back to handling your app, in the mean time your drive is actually spinning to save the data on the disk.

    4. How your compiler handles the memory. Data = Data + "STRING" vs. Data+="STRING" vs Data.Append("STRING") vs { DataVal2=malloc(6); DataVal2="STRING"; DataRec->Next = *DataVal2; } You could be spending O(n) time saving your memory where you can be doing in in O(1)

    Now sometime I do change my algorithm to write to the disk vs. handling it in memory. Mostly because the data I am processing is huge, and I much rather sacrifice speed, in order to insure that the data gets written.

    --
    If something is so important that you feel the need to post it on the internet... It probably isn't that important.
  6. Re:It depends by PacoSuarez · · Score: 4, Informative

    [...] For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second.

    I didn't RTFA, but after reading this I am certainly not going to. This C++ piece of code takes around 0.01 seconds to run on my computer:

    #include <iostream>
    #include <string>

    void build_string(std::string &s, std::string r) {
        for (int i = 0; i < 1000000; ++i)
            s += r;
    }

    int main() {
        std::string s;
        build_string(s, "a");
        std::cout s.length() '\n';
    }

  7. Re:Check their work or check the summary? by Anonymous Coward · · Score: 3, Informative

    Changed Java code to use StringBuilder instead of String += String. Results on my machine:
    1: 0.010625
    10: 0.002375
    100: 0.001

    Maybe somebody who study Chemical and Biological Science is not good developer

  8. python and java by Spazmania · · Score: 4, Informative

    They tested using strings in python and java, both of whose string libraries are very much overweight. And they tested by concatinating strings in a way that requires constant reallocations and memory copies versus pushing data to fixed size disk buffers in the OS cache.

    So... surprise! When writing data sequentially the C implementation of disk buffers is faster than the java and python implementations of strings.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  9. Re:It depends by Anonymous Coward · · Score: 2, Informative

    It's exactly this. The Java code they wrote uses String, resulting in an O(n^2) algorithm. A trivial change to StringBuilder would result in an improvement to O(n).

    The paper is just embarrassing.

  10. Re:It depends by gnupun · · Score: 4, Informative

    There's nothing wrong with Java or Python, but the programmer is inexperienced. Java and Python strings are immutable. So, any time they concatenate a single character to an existing string, the Java runtime creates a brand new string, leaving the original string intact (since it is immutable). So if they create a million character string using using million concatenations, guess what, a million new strings are created and that's very slow. A better solution is to use a mutable String aka, StringBuilder.

    But the right solution is to use a small buffer, say 16KB to 100KB in size, fill that with characters and flush that buffer to disk every time it's full. The speed would be same as any other method, but the max memory used is 20x smaller.

  11. Re:Stupid is as stupid publishes.... by Dastardly · · Score: 3, Informative

    It is even worse than that. They are using a BufferedWriter for the so called writing to disk portion. So, they are actually comparing the worst possible way to append a String in memory to appending bytes to a bytes buffer and periodically writing that to disk. So, basically comparing two different in memory string appending techniques? When you bring the OS into play it is even less likely to actually show anything having to do with the disk because the OS will write asynchronously.

    My grade: F- and they should be mocked mercilessly until the paper is retracted for being idiotic.

  12. Re:Check their work or check the summary? by OverlordQ · · Score: 4, Informative

    THATS THE ENTIRE POINT OF THIS PAPER.

    It is easy to explain the results: In high-level languages such as Java and Python, a seemingly benign
    statement such as concatString += addString may actually involve executing many extra cycles behind
    the scenes. To concatenate two strings in a language such as C, if there is not enough space to expand
    the concatString to the size it needs to be to hold the additional bytes from addString, then the
    developer has to explicitly allocate new space with enough storage for the sum of the sizes of the two
    strings and copy concatString to the new location, and then finally perform the concatenation. In Java
    and Python strings are immutable, and any assignment will result in the creation of a new object and
    possibly copy operations, hence the overhead of the string operations. The disk-only code, although
    apparently writing to the disk excessively, is only triggering an actual write when operating system
    buffers are full. In other words, the operating system already lessons disk access times. A developer
    familiar with the language and system internals readily notices the causes of this observed behaviour,
    but this behaviour may be easily missed, as indicated by examining similar cases in production code.

    --
    Your hair look like poop, Bob! - Wanker.