No, It's Not Always Quicker To Do Things In Memory

It depends by rossdee · 2015-03-25 03:48 · Score: 1

on the speed of your memory, and the speed of your disk, SSD's are getting more common.

Re:It depends by Lunix+Nutcase · 2015-03-25 04:04 · Score: 4, Insightful

Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.
Re:It depends by greg1104 · 2015-03-25 04:09 · Score: 5, Informative

SSDs and disk speed have nothing to do with this. None of these writes are hitting disk. All they've shown is that when you cache a write to disk, the operating system might add data to it more efficiently than the slow Python and Java string code can expand a string.
Re:It depends by hcs_$reboot · 2015-03-25 04:11 · Score: 4, Insightful

RAM *is* faster (by far) than any persistent media 9SSD, HD...). So whatever the test, the algorithm is probably bad,

--
Slashdot, fix the reply notifications... You won't get away with it...
Re:It depends by Carewolf · 2015-03-25 04:24 · Score: 3, Insightful

on the speed of your memory, and the speed of your disk, SSD's are getting more common.
No, it doesn't. Memory is faster. If they get a result saying otherwise, they are doing it wrong, and are actually just measuring the performance of the in-memory cache speeding up the simplest implementation vs the performance of their own crappy implementation.
Re:It depends by jedidiah · 2015-03-25 04:25 · Score: 5, Insightful

A more accurate title would be: "You can be sufficiently stupid with your memory access that it's faster to do disk IO."
Java is not the only system that can manifest this.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:It depends by aethelrick · 2015-03-25 04:29 · Score: 1

hehe... well said. If I had mod points you'd be getting them for making me laugh.
Re:It depends by ShanghaiBill · 2015-03-25 04:32 · Score: 5, Insightful

Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.
Indeed. Their results make no sense. They are doing something weird. For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second. Their code is listed at the end of the paper, and they seem to be assuming that "flush" means the code is actually written to disk. It does not. It just means the bytes were passed to the operating system.
The real story here, is that if you don't know how to write code properly, then string concatenation can be really slow.
Was their paper peer reviewed?
Re:It depends by Lunix+Nutcase · 2015-03-25 04:33 · Score: 1

It's on arXiv so probably not yet. Hopefully it gets sufficiently mocked.
Re:It depends by jellomizer · 2015-03-25 04:48 · Score: 3, Informative

In general writing to RAM is faster than writing to the disk. However there are things that get in the way of both.
1. OS Memory Management: So you making a small memory string to a big one. So will the os fragment the string, when it comes up to an other systems reserved memory spot. Will it overwrite it (Buffer overflow), will it find a contiguous larger memory block and copy the data there. Will it copy and move the memory slots to a new location away from the memory. Will this be happening preemptively, or when the error condition occurs, will all this stuff happen with a cpu cycle that is not sharing with your app. Also if you are low on memory the system may dump it to the disk anyways.
2. OS Disk management: A lot of the same concerns that memory management has. However a bunch of small request is easier to find free space, then asking for a larger spot. So they may be more seek time.
3. Disk Caching: You tell the program to append to the disk. The OS sends the data to the drive, the drive responds back Yea I got it. then the OS goes back to handling your app, in the mean time your drive is actually spinning to save the data on the disk.
4. How your compiler handles the memory. Data = Data + "STRING" vs. Data+="STRING" vs Data.Append("STRING") vs { DataVal2=malloc(6); DataVal2="STRING"; DataRec->Next = *DataVal2; } You could be spending O(n) time saving your memory where you can be doing in in O(1)
Now sometime I do change my algorithm to write to the disk vs. handling it in memory. Mostly because the data I am processing is huge, and I much rather sacrifice speed, in order to insure that the data gets written.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Re:It depends by Anonymous Coward · 2015-03-25 04:54 · Score: 5, Funny

Was their paper peer reviewed?
It just was. Why do you ask?
lololol
Re:It depends by PacoSuarez · 2015-03-25 04:55 · Score: 4, Informative

[...] For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second.
I didn't RTFA, but after reading this I am certainly not going to. This C++ piece of code takes around 0.01 seconds to run on my computer:
#include <iostream>
#include <string>
void build_string(std::string &s, std::string r) {
for (int i = 0; i < 1000000; ++i)
s += r;
}
int main() {
std::string s;
build_string(s, "a");
std::cout s.length() '\n';
}
Re:It depends by jeffmeden · 2015-03-25 05:00 · Score: 1

RAM *is* faster (by far) than any persistent media 9SSD, HD...). So whatever the test, the algorithm is probably bad,
I read this summary as "when the goal is to write a string to disk, building it in memory first is slower than just writing it to the damn disk in the first place".
Followed by a "does this mean my cafeteria meal card is going to get renewed?" at the end.
Re:It depends by Lunix+Nutcase · 2015-03-25 05:10 · Score: 3, Insightful

Except they don't write to disk. They wrote to an OS controlled buffer. Simply calling flush does not force a disk write. It signals the OS to take control of the buffer.
Re:It depends by TWX · 2015-03-25 05:16 · Score: 1

This is just a guess as I'm not a programmer but am acquainted with computer architecture...

If they're writing the string to disk and not really reading it back constantly then the act of writing could be being handed to the disk controller in chunks and effectively offloaded, which would reduce CPU time used for those sequential writes compared to the CPU handling all of the work in memory as the process goes then handing it to the disk controller only once to write it.

I don't think this would hold up if they're having to read from the disk versus read from RAM or cache, which is how I expect most real-world applications to work. Their comparison reminds me of putting an old van with an inline 6 cylinder engine on a dyno roller with almost no resistance and spinning the wheels up to 140mph, then claiming that the van is as fast as a muscle car.

--
Do not look into laser with remaining eye.
Re:It depends by Bengie · 2015-03-25 05:23 · Score: 1

Pretty much. What they're saying is it's faster to write data strait to disk than to copy that data around in memory, then write it do disk. In other words, A is less than A + B
Re:It depends by Rakarra · 2015-03-25 05:27 · Score: 2

Probably not, and sadly this is the problem with current CS tracks in colleges.
Zero education on hardware gives us CS grads that are inept.
It's not necessarily that -- I think there's a third layer that doesn't get the attention deserved while people work on end-user applications or tinker building hardware. There's not much attention on operating system design and fundamentals. Your code will usually be dealing with an OS and rarely with the bare hardware, so I'm surprised there's as little attention about operating systems principles and design.
Re:It depends by Lunix+Nutcase · 2015-03-25 05:32 · Score: 1

Except they don't write data straight to disk. They write to an OS-controlled buffer. Their code only forces the buffer to be passed to the OS for it to do with as it pleases.
Re:It depends by imnes · 2015-03-25 05:33 · Score: 1

It looks like they're creating 1,000,000 strings (from 1 byte, up to 1,000,000 bytes). Isn't that ~ 500GB of memory allocations and releases?
Re:It depends by Penguinisto · 2015-03-25 05:38 · Score: 4, Interesting

That's the very first thing I thought of... what if the code were written in a lower-level language (and not in fucking python or Java!), then made do this task on Windows $latest, OSX $latest, Linux $latest, maybe a resurrected DOS $latest for reference, etc... I mean, it can't be that hard to write this thing in C and port it as needed.
Doesn't seem very scientific at all otherwise. I mean, are they testing memory versus disk, are they testing memory vs. disk performance in a given specific language, or what? Maybe they just needed to flesh out their abstract a bit more to reflect this?

--
Quo usque tandem abutere, Nimbus, patientia nostra?
Re:It depends by Penguinisto · 2015-03-25 05:39 · Score: 1, Troll

They used Java and Python. Draw your own conclusions from that. ;)

--
Quo usque tandem abutere, Nimbus, patientia nostra?
Re:It depends by sjames · 2015-03-25 05:41 · Score: 4, Insightful

It makes perfect sense once you read the paper. The conclusion is techniocally correct but deceptive.
The results apply in the case of Java and Python where strings are immutable objects. They also used buffered I/O handled by libc. When you concatenate immutable strings, you must allocate a new string large enough to hold both parts, then a memcpy from both of the parts is performed to construct it. The parts are eventually garbage collected.
In contrast, writing to a file with buffered I/O means just copying the additional write buffer to the current end of the buffer and moving updating the accounting information.
As a result, in both cases, only one actual filesystem transaction takes place writing out the complete string. Thus, the actual practical difference between the two methods is that the 'in memory' version copies the memory around many times while the 'disk i/o' one copies the data once (in multiple steps, but each byte sees one copy).
That seems like a bit of a no-brainer, but the point is valid because many programmers may deceive themselves into thinking the 'in memory' method is faster because they don't take the file i/o buffering and the way immutable strings are handled into account.
Re:It depends by lgw · 2015-03-25 05:41 · Score: 5, Insightful

How in the world? Trivially. They're doing it in an O(n^2) way - it's the only explanation.
If you use string concat library code naively, you can end up "copy the string, add one byte, repeat" easily enough in languages like Java. And it's not exactly breakthrough research to discover that O(n) disk can be faster than O(n^2) memory for large enough n.

--
Socialism: a lie told by totalitarians and believed by fools.
Re:It depends by Beat+The+Odds · 2015-03-25 05:43 · Score: 3, Funny

Was their paper peer reviewed?
I believe that it may have been beer reviewed.
Re:It depends by Penguinisto · 2015-03-25 05:43 · Score: 1

Ditto here... and GP has a damned good summary to boot.

--
Quo usque tandem abutere, Nimbus, patientia nostra?
Re:It depends by Ronin+Developer · 2015-03-25 05:45 · Score: 1

All else being equal, I am betting that they coded in a language that:
1) Uses the heap to allocate/reallocate memory (ie. 1 million times).
2) Uses non-mutable strings.
This will be significantly less efficient (i.e painfully slow) than writing each byte to a buffer in the HD and then committing the buffer to disk.
Snap...just read the article. They used Java and Python...need I say more.
Re:It depends by Ignacio · 2015-03-25 05:47 · Score: 1

That they barely knew what they were doing. But what do Java and Python have to do with it?
Re:It depends by sjames · 2015-03-25 05:51 · Score: 1

It is terrible, but its 'terribleness' may be non-obvious. The paper's actual purpose is to point that trap out so people don't fall in to it. Do read it.
Re:It depends by Megol · 2015-03-25 05:59 · Score: 1

I'm not saying these guys didn't goof up in some way, or if they were right, it's just that sometimes the old paradigm of how everyone believes things work is just plain wrong.
I'll give one example from when I was in high school. As any programmers among the readers know, the slowest form of sort is the bubble sort.
Wrong. It is commonly the slowest _real_ sorting algorithm but for some data it is very fast. Sorted or almost sorted data are the best cases for it.

We figured out how to make it faster than all other types of sorts. We kind of freaked when our trick not only worked, but it made it the fastest. We then tested it and worked out an algorithm to keep it at it's fastest.
For generic data that is simply impossible.

The old paradigm that bubble sorts are the slowest sorts got destroyed, so it's always going to be possible that old accepted assumptions about how things work can be overturned, even if it requires certain specific parameters. (Of course crap code will F anything, so that's not what we're talking about.)
Either you forgot to list limitations to the problem that make your assumptions above right for that _specific_ problem or you are completely wrong.
Re:It depends by swilver · 2015-03-25 06:00 · Score: 1

So will the os fragment the string, when it comes up to an other systems reserved memory spot. Will it overwrite it (Buffer overflow), will it find a contiguous larger memory block and copy the data there
Wow, this may have been the case eons ago when MMU's didn't exist, but in a modern day OS, you get an address space that's all your own. Nobody else is using it, and real memory is simply mapped (usually in chunks of 4k or 8k) into your address space -- it can be fragmented all over the place, it will still look like one nice big chunk to your program.

OS Disk management: A lot of the same concerns that memory management has. However a bunch of small request is easier to find free space, then asking for a larger spot. So they may be more seek time
Most filesystems already work with a minimum of 4k chunks. I cast serious doubts on your claim that smaller chunks are easier to find, it would depend on the data structure used for tracking free space and whether the filesystem is reserving space for you by leaving gaps at your last write location.
Re:It depends by Trailer+Trash · 2015-03-25 06:01 · Score: 3, Insightful

The real story here, is that if you don't know how to write code properly, then string concatenation can be really slow.
Was their paper peer reviewed?
I just reviewed it, but frankly, they're not my peers.
They actually understand the problem and state it near the end of the paper. The issue is pretty simple and when I read the /. summary I knew what the problem was. They're appending single bytes to a string. In both chosen languages - Java and Python - strings are immutable so the "concatenation" is way the hell more complex than simply sticking a byte in a memory location. What it involves is creating a new string object to hold both strings together. So, there's the overhead of object creation, memory copying, etc. Yes, by the time you're done it's a lot of extra work for the CPU.
I'm going to state this as nicely as I can: what they proved is that a complete moron can write code so stupidly that a modern CPU and RAM access can be slowed down to the extent that even disk access is faster. That's it.
Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte, copying the old string and then adding the new character and new null byte. Scanning and copying are both going to require an operation for each byte (yeah, it could be optimized to take advantage of the computer's word length) on each iteration, with that byte count growing by "1" each time.
The sum of all integers up to N is N(N+1)/2. If N is 1,000,000 the sum is 500,000,500,000. So, counting bytes (looking for null) requires half a trillion operations and copying bytes requires another half trillion operations. Note that "operations" is multiple machine instructions for purposes of this discussion.
Yeah, modern computers are fast, but when you start throwing around a trillion operations it's going to take some time.
Writing to disk will be faster for a number of reasons, mainly because the OS is going to buffer the writes (and know the length of the buffer) and handle it much much better. It's not doing a disk operation every time they do a write. If they were to flush to disk every time they would still be waiting for it to finish.
There are a few notes, here. First, in Java and Python the string object likely holds a "length" value along with the actual character buffer. That would make it faster and not require all the operations the badly written C code that I describe above would require. But the overhead of objects, JVM, interpreter, etc. gets thrown into the mix. Second, if I were doing something like this in C I could keep the string length as part of a struct and at least make it that much faster. The point is that a good programmer wouldn't write code in this manner.
Anyway, this "paper" proves nothing except that really bad code will always suck. One would have to be an idiot to write anything close to what they've done here in a real-life scenario. I know because I've cleaned up other people's code that's on the level of this junk...

--
Do you have ESP?
Re:It depends by DigiShaman · 2015-03-25 06:01 · Score: 1

Starting with Windows 8 / 2012, developers have restricted disk / volume I/O access. I'm not sure how this plays in regards to the testing, but link below nonetheless.
https://msdn.microsoft.com/en-...

--
Life is not for the lazy.
Re:It depends by gnasher719 · 2015-03-25 06:15 · Score: 1

Indeed. Their results make no sense. They are doing something weird. For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second. Their code is listed at the end of the paper, and they seem to be assuming that "flush" means the code is actually written to disk. It does not. It just means the bytes were passed to the operating system.
What are the bets that they didn't actually append a byte to a string, but created a new string consisting of an old string with one byte added?

In Objective-C, this would be using NSString instead of NSMutableString. In Java, which they were using, probably using a String instead of a StringBuilder or something similar. A file containing a string is basically a mutable string.

So the headline should be: Doing things on disk can be faster than doing incredibly stupid things in memory.
Re:It depends by Anonymous Coward · 2015-03-25 06:17 · Score: 1

I know, I know, this is slashdot and we don't read no stinking articles around here... But seriously, the only thing that's crappy is your comment, because, not having read the paper, you have no clue what their results are. I did read it and let me tell you, they are correct, properly explained, and relevant to the real world. They won't win any awards for it, because it is pretty simple, though.
Re:It depends by rrr00bb5454 · 2015-03-25 06:18 · Score: 1

The silliness of the paper is that there is no reason at all to keep previously submitted chunks in memory, and it's like somebody discovered that naive string appends are quadratic in memory allocation. On day 2 of everybody's first job, they learn to just append strings to a list and either flatten them to the one big string you need at the end, or evict the head of the list out somewhere (disk?) when a reasonable chunk size (optimize for block size) or amount of time (optimize for latency) has passed. I would imagine that in this case, you should simply queue up writes in memory into a constant-sized and pre-allocated buffer, and flush to disk as soon as it is the size of a disk block.
Re:It depends by TemporalBeing · 2015-03-25 06:24 · Score: 2

Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte, copying the old string and then adding the new character and new null byte. Scanning and copying are both going to require an operation for each byte (yeah, it could be optimized to take advantage of the computer's word length) on each iteration, with that byte count growing by "1" each time.
Actually, you can "extend" a C-style string just fine in C - just replace the NULL byte with another byte. It's a common error in C programs to miss the NULL byte.

This works because C doesn't do boundary checks and will gladly let you overwrite your stack or heap.

Unlike Java, C doesn't try to protect you from yourself.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
Re:It depends by edtice1559 · 2015-03-25 06:25 · Score: 1

I'm pretty sure that, in order to write something to disk, it first has to be written to memory! I don't think there is a function that goes right from a register to disk.
Re: It depends by samkass · 2015-03-25 06:26 · Score: 1

Actually, even beginner Java programmers know to use a StringBuilder for these cases, which allows for constant-time appending. It's a little harder to do "right" in C and accidentally get O(n^2) time by reallocating memory each time, but still not hard. The language here isn't making the difference it's their algorithm.

--
E pluribus unum
Re:It depends by suutar · 2015-03-25 06:28 · Score: 1

yeah. Looks like they're comparing concatenation (with all the attendant object creation and obsoletion) with the equivalent of using a stringbuffer (that happens to be held by the kernel). Shock, amazement.
Re:It depends by suutar · 2015-03-25 06:31 · Score: 1

This. They've discovered that stringbuffers are faster than repeated string concatenation, is all.
Re:It depends by Trailer+Trash · 2015-03-25 06:35 · Score: 2

Well, yeah, but that's not going to work consistently. Worst case is if the string is on the stack you'll smash the stack and likely have a memory access error. If it's on the heap you'll likely get the error quicker.
I wouldn't even think of writing a program in the manner in which their sample was written, but if I was trying to solve their basic "problem" there are better ways to go about it.

--
Do you have ESP?
Re:It depends by PsiCTO · 2015-03-25 06:36 · Score: 1

"Was their paper peer reviewed?"
I think all the other students were too busy writing exams, so no...
Okay, that's mean. To be fair, one author is a Project Manager. However, another is an Associate Professor who teaches "W2014 - SENG 533 - Software Performance Evaluation"... this is a concern for his students.
Re:It depends by jbengt · 2015-03-25 06:40 · Score: 1

RTFP before you complain, it already addresses your concerns. The point is about assuming that disk write will be slower, when, in real life, some specific programs can be sped up by writing directly to disk. They mention that the OS takes care of disk buffering for you and note a lot of stuff that is happening behind the scenes in memory, especially with immutable strings in high level languages.
Re:It depends by twitnutttt · 2015-03-25 06:43 · Score: 1

The study found: "For example, using Java (on both Windows and Linux) to concatenate 1,000,000 1-byte strings in-memory and doing a single write to disk was 9,000 times slower than simply doing 1,000,000 disk writes. The in-memory approach was faster when the code was written in Python instead of Java, but was still hundreds of times slower than the write-to-disk-only approach when doing many concatenations. As expected, as the number of string concatenations decreased, the in-memory approach got closer and closer to the time required by the disk-only approach."
This does make some sense, as string concatenation in Java is well-known to be an inefficient process. (Hopefully they're using StringBuffer and not plain String!) But the fact that it holds true in Python also, albeit orders of magnitude better, is surprising.
Re:It depends by Anonymous Coward · 2015-03-25 06:46 · Score: 2, Informative

It's exactly this. The Java code they wrote uses String, resulting in an O(n^2) algorithm. A trivial change to StringBuilder would result in an improvement to O(n).
The paper is just embarrassing.
Re:It depends by gnupun · 2015-03-25 06:50 · Score: 4, Informative

There's nothing wrong with Java or Python, but the programmer is inexperienced. Java and Python strings are immutable. So, any time they concatenate a single character to an existing string, the Java runtime creates a brand new string, leaving the original string intact (since it is immutable). So if they create a million character string using using million concatenations, guess what, a million new strings are created and that's very slow. A better solution is to use a mutable String aka, StringBuilder.
But the right solution is to use a small buffer, say 16KB to 100KB in size, fill that with characters and flush that buffer to disk every time it's full. The speed would be same as any other method, but the max memory used is 20x smaller.
Re:It depends by V-similitude · 2015-03-25 06:50 · Score: 1

Indeed. Relevant text from TFA:

The explanation offered by the authors is that these higher level languages are doing a lot of work behind the scenes to handle the concatenation, such as creating new objects and copying the strings in order to accommodate the extra bytes of data. “The above explanation applies to any data structure that has to be stored contiguously and increases in size, or is immutable,” they wrote. Conversely, the disk-access approach was faster because the operating systems handled the writes efficiently via buffering and only actually wrote to disk when necessary.
They're trying to point out exactly what everyone here is trying to say they're missing. Not really sure it warrants a research paper, but yeah, common sense if you've ever studied computer science at all.
Re:It depends by Anonymous Coward · 2015-03-25 06:51 · Score: 5, Insightful

And they're using BufferedWriter to write to the file which, as the name suggests, is buffering the data *in memory* before writing it.
So the result of the paper is actually O(n) in memory algorithm outperforms O(n^2) in memory algorithm for data sizes of 1MB. Hardly surprising.
Re:It depends by gbjbaanb · 2015-03-25 06:57 · Score: 1

I think this shows the education of modern programmers.
Take a string, append 1 byte. Repeat a million times. Say "why is it so slow?".
Its probably because every time you write to most strings classes, you're making a copy and re-allocating the whole lot, and then deallocating the original.
If you knew C, you'd know what was happening here. This is why we need to teach C to programming students and not Java. Once they know C they can learn Java or whatever takes their fancy on their own time.
(although even Java and .NET programmers should understand what a stringbuilder is and why you'd use it)
Re:It depends by AlejoHausner · 2015-03-25 07:08 · Score: 2

OK, so the authors are bad programmers and don't understand how string concatenation works. Strings are contiguous arrays, whereas disk files are made up of consecutive blocks, which are accessed through an index. If you want to append to a file, you may add a block, and modify the end of the index. But if you want to append to an array, you are forced to allocate a whole fresh array, because strings use fixed-size arrays.
On the other hand, Java StringBuffers have amortized O(1) append cost. A StringBuffers occasionally re-allocate themselves to larger pieces of memory, and the amortized cost of an append is O(1).
Re:It depends by ndykman · 2015-03-25 07:23 · Score: 1

It doesn't warrant a paper. This is basic algorithms. They deserve all the ridicule they get, and I hope it is a lot.
Re:It depends by BarbaraHudson · 2015-03-25 07:31 · Score: 1

Anyone with any brains would have pre-allocated an array, then written into it at the appropriate offsets. These people are idiots.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:It depends by Darinbob · 2015-03-25 07:41 · Score: 1

The actual report had other factors involved. Ie, it kept growing a string using a high level language like Java or Python. Thus the underlying runtime system would have to keep growing the string, freeing and reallocating memory, exacerbating garbage collection frequency, etc. Meanwhile the file system is happily buffering all the writes. My guess is that on a similar system using C that the smart programmer isn't going to see these effects, either by relying on the file system to do buffering or else use a fixed size buffer.
Re:It depends by Darinbob · 2015-03-25 07:48 · Score: 1

Yes, the code is goofy. They're in a very high level language. If they concatenate a million one byte strings one at a time, then there's one million strings that have been created. Ie, reallocate the memory constantly, keep the garbage collector busy, and so forth. Ie, the effort in trying to get the entire contents in memory being build up incrementally was slower than the actual I/O time. Ie, 4 minutes versus a fraction of a second. Meanwhile if they had just written one byte at a time to the file system then both Windows and Linux would happily buffer up that data and write it out much more optimally.
Re:It depends by Arancaytar · 2015-03-25 08:02 · Score: 1

For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second
wtf. The only way it could take that long would be if they were concatenating them as immutable strings and had to copy the result repeatedly.
Re:It depends by ceoyoyo · 2015-03-25 08:32 · Score: 1

They're not doing something weird, the article is crazy.
Basically, they wrote some shitty code to do highly inefficient string concatenation and, wow, it turns out that it's less efficient than the caching code in the operating system. They're not comparing in-memory versus disk operations at all.
Re:It depends by ceoyoyo · 2015-03-25 08:39 · Score: 3, Insightful

One of them looks like a chemical engineering PhD student and the other is a tech, so maybe not. The third is an electrical engineering professor who's supposed to be doing software performance research though. He should definitely know better.
Although, when I was at the U of C the people doing software stuff in the EE department had some very interesting ways of doing things.
Re:It depends by ceoyoyo · 2015-03-25 08:47 · Score: 1

The Python code
s = ''
for i in range(0,1000000):
s += str(i)
will be painfully slow too, for multiple reasons. Both Java and Python have ways to do this that aren't dumb.
Re:It depends by pjt33 · 2015-03-25 08:50 · Score: 1

It is bad, but I've seen worse.
Re:It depends by Anonymous Coward · 2015-03-25 09:15 · Score: 1

It doesn't warrant a paper.
Why? If you were trying to argue it is not appropriate for a particular journal, that would be one thing. But this is just thrown up on Arxiv, which includes papers varying from blog level commentary to course notes to student projects. It is just a short paper explicitly stating that stupid or contained coding examples can break rules of thumb, and that it is important to know something about underlying mechanisms.
Re:It depends by Pseudonym · 2015-03-25 09:22 · Score: 2

But the right solution is to use a small buffer, say 16KB to 100KB in size, fill that with characters and flush that buffer to disk every time it's full.
Which is to say, do what every programming language with buffered I/O does.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:It depends by Pseudonym · 2015-03-25 09:27 · Score: 1

Also relevant text from TFA:

In this paper we use code inspired by real, production software [...]
Sadly, that's probably 100% accurate.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Re:It depends by Atzanteol · 2015-03-25 09:30 · Score: 1

They did do it with StringBuilder also - and showed a large improvement. It's like they read your mind!

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:It depends by Atzanteol · 2015-03-25 09:31 · Score: 1

Didn't read the paper eh?

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:It depends by Atzanteol · 2015-03-25 09:43 · Score: 1

In the Java application they're using a BufferedWriter as well - so they're buffering before the OS buffer.
It seems pretty clear to me that "concatenating a string then writing it to a buffer and flushing that to disk" would be faster than "writing a bunch of strings to a buffer then flush that to disk." They're basically copying that data around at least twice.

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:It depends by chuckugly · 2015-03-25 09:58 · Score: 1

Actually the way they did it, when you write to a file you add a byte to a memory buffer and the runtime and/or OS and/or disk controller will then write that across the IO bus to cache in the disk, where it will finally get carved into the platter. I believe.

What they are doing is exploring two different ways to concatenate in memory, one using Java and one using lower level code written in (probably) C.
Re:It depends by Atzanteol · 2015-03-25 10:07 · Score: 1

Which makes it very strange that they would think to write a paper on it. It's not even worth a blog post.
Not to mention that in the Java implementation they're writing to a BufferedWriter. So even with the StringBuilder they're comparing "concatenating a string, writing it to a buffer and writing that to disk" to "copying strings to a buffer then writing that to disk."
If you do more work it takes longer. QED.

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:It depends by Quatermass · 2015-03-25 10:15 · Score: 1

There will be a mass of garbage collection....Slow, very slow.

--
Stuart http://stuarthalliday.com/
Re:It depends by 31eq · 2015-03-25 10:22 · Score: 1

The simple way to do it in Python is

>>> import time >>> import random >>> chars = [chr(random.randint(0,255)) for each in range(1000000)] >>> stamp=time.time(); joined=''.join(chars); time.time()-stamp 0.11153912544250488

So that's 1/9 of a second on a not very fast laptop. You can try it yourself. Obviously not as fast as C++ but not horrible.
Re:It depends by WolfWithoutAClause · 2015-03-25 11:29 · Score: 1

Yeah, and if you think about it, the OS's disk buffer is an array and it's already allocated. ;)

--
-WolfWithoutAClause
"Gravity is only a theory, not a fact!"
Re:It depends by bingoUV · 2015-03-25 11:33 · Score: 1

The above explanation applies to any data structure that has to be stored contiguously and increases in size, or is immutable
But it was never necessary for the "in memory operation" (their words) to use an immutable data structure. If you use bad data structure when using in-memory, of course in-memory will be slower than disk.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.
Re:It depends by NoOneInParticular · 2015-03-25 11:37 · Score: 1

They're testing the common assumption that to do anything fast, you have to do it in memory, regardless of language etc.They take production code to test this. Their work falsifies the claim that memory operations are necessarily faster. The abstract is pretty clear about this, not sure where you got confused.
Re:It depends by DeKO · 2015-03-25 11:39 · Score: 1

Replying to undo accidental moderation.
Also, the python code is a prepend (concatString = addString + concatString) to force a quadratic algorithm, when doin in-memory; which is completely different from the direct-to-file append. When they changed the order, to compare append vs append, in-memory-then-file was faster.
Re:It depends by bingoUV · 2015-03-25 11:40 · Score: 2

But except saying it "dramatically" improves results, the StringBuilder result wasn't worthy of a mention or a compare against disk performance.
Obviously, like any good "researcher" does, the conclusion was written first and then the "experiment" was performed. Any results contradicting the conclusion have to excluded.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.
Re:It depends by ILongForDarkness · 2015-03-25 12:02 · Score: 2

Yeah I'd say bad implementation. They could have some performance improvement depending on timings and such though. Messing around in memory + one fairly large (only 1MB so not really but lets say for arguments sake) vs many smaller writes depending on how the OS handles the write requests you might end up hitting the disk cache and then doing work while the disk is busy spinning and actually writing out your changes. With one big write you might end up hitting some limit that makes the thing not fully buffered in cache and have to wait for the disk to actually complete the write.
Re:It depends by gnupun · 2015-03-25 12:09 · Score: 1

The point is about assuming that disk write will be slower, when, in real life, some specific programs can be sped up by writing directly to disk.

No, you can't speed up something by using something that is at least 10,000 times slower than the alternative.
// Second part: disk-only try { writer = new BufferedWriter( new FileWriter("test.txt")); startTime = System.currentTimeMillis(); for (int i=0; i < numIter; i++) { writer.write(addString); } writer.flush(); writer.close(); endTime = System.currentTimeMillis();
Their so-called code that "directly writes to disk" in fact writes several KB to memory (using a BufferedWriter object) and then writes that memory to disk at several intervals. So their argument that disk-based code can be faster than memory-based code is completely false in this case.
This is what the javadocs for BufferedWriter state:

Writes text to a character-output stream, buffering characters so as to provide for the efficient writing of single characters, arrays, and strings.
The buffer size may be specified, or the default size may be accepted. The default is large enough for most purposes.
Re:It depends by sjames · 2015-03-25 14:47 · Score: 1

Quite possibly. Sadly, I'm not so sure about the number of people who studied CS who find this obvious. A lot of people see a bunch of f.write and think I/O, must be a faster way.
Re:It depends by Arkan · 2015-03-25 18:43 · Score: 1

Exactly this! And for the java part at least, they should use NIO channels, which are designed to be closer to the system.
What they determine, in fact, is that their coding knowledge is sub-par. Not to be unexpected from people in biological sciences!
The real issue is: how come Slashdot editor didn't saw this as soon as the story was submitted and put it back where it belongs: /dev/null?
Re:It depends by Carewolf · 2015-03-26 00:13 · Score: 1

Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte,
There is. It is called realloc. If you are unlucky, it will just divide the number of times the system actually performs by 16 or whatever the malloc implementation uses as an alignment, but once the allocation gets big enough you get a pages directly from the system, and it just maps in more pages on the end.
Re:It depends by Trailer+Trash · 2015-03-26 00:36 · Score: 1

Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte,
There is. It is called realloc. If you are unlucky, it will just divide the number of times the system actually performs by 16 or whatever the malloc implementation uses as an alignment, but once the allocation gets big enough you get a pages directly from the system, and it just maps in more pages on the end.
malloc isn't the problem, though. My point was that if you write it in the style of the code in the paper (don't keep track of the string length between character appends) then it'll still have to scan the string a million times. If you know ahead of time that you're going to append exactly one million characters to the string then you need but one malloc, right? I can make this program extremely fast in that manner but that's not what they're doing.

--
Do you have ESP?
Re:It depends by cinky · 2015-03-26 01:57 · Score: 1

They are shitty devs, that's my conclusion...
Re:It depends by rioki · 2015-03-26 02:21 · Score: 1

This is only an iteration of sibling posts, but they are using the most retarded solution to build a sting in memory. Something that you learn in your introductory course, use StringBuilder, because performance...
Re:It depends by rioki · 2015-03-26 02:32 · Score: 1

Actually I think with modern OS and compiler the opposite is true. The moment you overwrite your stack canaries and return address you app goes *poof*. (No message box, no error handler, just disappears from the process list.) You can live with corrupted heap objects for a good while; especially if you wrote over the
end and don't try to free / reallocate the following heap object.
Re:It depends by sjames · 2015-03-26 05:35 · Score: 1

Because StringBuilder works so well in Python?
Re:It depends by TemporalBeing · 2015-03-26 06:13 · Score: 1

If that's your idea of "extending a string" then perhaps you should be using a language which protects us from you, er, I mean you from yourself.
It was meant a counter to the GP saying that it was impossible to "extend" a string in C .

Not saying it's the correct way to do it, just that there are possibilities that the GP did not even consider, probably b/c they were taught to program using a language that protects them too much.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
Re:It depends by TemporalBeing · 2015-03-26 06:20 · Score: 1

Well, yeah, but that's not going to work consistently. Worst case is if the string is on the stack you'll smash the stack and likely have a memory access error. If it's on the heap you'll likely get the error quicker.
I wouldn't even think of writing a program in the manner in which their sample was written, but if I was trying to solve their basic "problem" there are better ways to go about it.
That depends on your program, and how much memory was allocated and when it would get detected. The OS is not going to detect anything until you try to leave the bounds of the program itself. Take the following function for instance:

void runOverBuffer(void)
{
char* buffer[10]; // 10 bytes
char* buffer2[1*1024*1024*1024]; // 1 GB
...
}

You can extend buffer into buffer2 without any detections going off, or even any ill-effects until you surpass buffer2 and all the other variables in the function.

Heap allocated functions are a little more tricky but even then you can produce the same kind of behavior if you really wanted to - even with the HEAP randomization, which really doesn't protect the program internally, it only protects the program from the libraries the program uses by randomizing where they are loaded.

And since you control the program, you can control the optimizations so that the only that would mess you up - by re-arranging variables - are not run.

As I pointed out elsewhere, the point is not that it's the right way to do it. It's that it is possible to do in C, just as possible as in Assembly.

--
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
Re:It depends by unrtst · 2015-03-26 06:29 · Score: 1

Snap...just read the article. They used Java and Python...need I say more.
Yes. You and everyone else I've read thus far should be saying just a little bit more.
Their python example is doing this for in memory concatenation:
for i in range(0, numIter):
concatString = addString + concatString ... but this for the disk-only version:
for i in range(0, numIter):
f.write(addString)
Those are not the same. Try doing the same with the disk-only version (PRE-pending the value).
They even mention it and provide the corrected code for the in-memory version:
for i in range(0, numIter):
concatString = concatString + addString
When using that, it performs virtually the same as the disk-only solution.
Granted, your point still stands, and they note that in their paper (it's pretty much the whole point of the paper). I just think comparing two completely different things is pretty stupid and worth noting.
Re:It depends by nctritech · 2015-03-26 13:14 · Score: 1

It's not a programming problem, it's a programmer problem.
Re:It depends by jeremyp · 2015-03-27 01:21 · Score: 1

You didn't read their paper properly. They make exactly the point that you are making. i.e. that "writing to disk", in most cases, does not mean physically writing individual bytes to the disk. The abstractions provided by both the language and the operating system help to make the obvious implementation as fast or faster than naive programmer created optimisations. In other words, this is a confirmation of the saying

premature optimisations the root of all evil
There is a WTF in the paper and it is their claim that Python doesn't run in a VM.

--
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
Re:It depends by Eunuchswear · 2015-03-29 18:17 · Score: 1

No, in the case of the "paper" cited it depends on the gross stupidity of the algorithm.
They're comparing:

string s = ""; for (1..1000000) { s = s + "x"; } write (x);

With

for (1..1000000) { write ("x"); }

For fucks sake!

--
Watch this Heartland Institute video
Re:It depends by Eunuchswear · 2015-03-29 18:24 · Score: 1

They prove that string concatenation (in memory) is slower than writing to buffers (in memory).
I.E. that :

char *s, *new; new = malloc (len + 1); memcpy (s, new, len); new [++len] = 'x'; free (s); s = new;

is slower than:

buf [len] = x; if (++len > sizeof buf) { write (out, buf, len); len = 0; }

What a fucking joke.

--
Watch this Heartland Institute video

The new antipattern by ubergeek65536 · 2015-03-25 03:48 · Score: 3, Funny

Sorry but you'll need to do it without using any memory. We need to make it fast.

Re:The new antipattern by Anonymous Coward · 2015-03-25 04:42 · Score: 5, Insightful

Sorry but you'll need to do it without using any memory. We need to make it fast.
Memory bandwidth is about 20Gb/s. Disk bandwidth is about 0.05Gb/s. The performance consequences of this are obvious to anyone who knows how basic arithmetic works.
The results they got are invalid because their test framework is broken. This is exactly why everyone should be forced to learn C/C++ or Assembler in college/university. The reason for the crap result is they did not preallocate their buffers so they wasted all their execution time allocating and reallocating larger buffers from the heap. The disk APIs have their own internal buffer implementations, that were not written by idiots, that manage this correctly which is the cause of the difference.
Re:The new antipattern by Anonymous Coward · 2015-03-25 05:43 · Score: 1

Sorry but you'll need to do it without using any memory. We need to make it fast.
Memory bandwidth is about 20Gb/s. Disk bandwidth is about 0.05Gb/s. The performance consequences of this are obvious to anyone who knows how basic arithmetic works.
Let me convert this real quick...0.05Gb/s is either 50MB/s, if you wrote it incorrectly, or 6.25MB/s in the case you really did mean little b.
Are you trapped in the previous millenium? Most USB2 devices can write at >6MB/s and USB3 devices write at almost 50MB/s. Good SSD's nowadays will write at 200-400MB/s+ and spinning drives will hit 100MB/s. Current high speed DDR3 hits a ceiling around 17GB/s. All of those numbers are big B, so multiply by 8 (if you know basic arithmetic) for little b values.
http://blog.laptopmag.com/whats-the-best-ssd-5-drives-tested/2
Re:The new antipattern by tricorn · 2015-03-25 12:18 · Score: 1

I wouldn't say the results are invalid, but the relevance is restricted to people who don't understand algorithms or statements such as "disk is slower than memory".
I once had to fix a program that was reading all the file names in a directory into a linked list, sorting it (using operations to retrieve, remove, and insert elements using an index, which worked by starting at the beginning of each list and counting elements until it got to the correct one), then using the resulting sorted list to process the first 10 files.
Rather than fix the abominably slow sort, I used the fact that all the file names were decimal numbers, and all the numbers were sequential, to scan the directory for the smallest number, then just increment that to find the next one. Needless to say, it was both much faster and used very little memory.
Algorithms matter, and the shame of ever faster processors and "more productive" languages is that too many programmers don't understand them.

Check their work or check the summary? by s.petry · 2015-03-25 03:51 · Score: 4, Insightful

'll have to dig through their testing and methods, but this seems pretty fishy given the summary.

Seek/Read/Write time of a disk is always slower than memory. No exceptions to the rule exist given current commodity hardware. Bus length to a disk is also much longer than to memory. Again, there are no exceptions given commodity hardware.

Won't be the first time someone reported that the laws of physics don't exist for something, and I'm sure it won't be the last. Maybe someone with free mornings in the US can break it down better than the summary.

--

-The wise argue that there are few absolutes, the fool argues that there are no probabilities.

Re:Check their work or check the summary? by LordLimecat · 2015-03-25 03:54 · Score: 4, Interesting

Tl; DR:
They used python and java. Sort of hard to develop a meaningful thesis on general programming when you're that far up the abstraction stack. Who knows, maybe python and Java suck at memory management (GASP).
Re:Check their work or check the summary? by s.petry · 2015-03-25 03:56 · Score: 2

Not to Karma whore, but I already see two problems with their testing by reading their code samples. Lets see who else finds them. The simple answer is no, disk is not slower than memory. The long answer is yes, programmers can make it look that way.

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by s.petry · 2015-03-25 03:56 · Score: 2, Insightful

read their code, you will see the problems.

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by MightyYar · 2015-03-25 03:59 · Score: 1

The worst is that they DO know, and yet still wrote a whole paper on it. They are concatenating strings in Java and Python - which is slow. Surprise, surprise, it is faster to write strings to a pre-allocated buffer, in this case the disk cache. That is the whole paper.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Check their work or check the summary? by topology · 2015-03-25 04:01 · Score: 1

More than likely its employing concurrency between computation and buffered writes to disc. This is really just a special case, a specific exception to a good principle.

You have 2 processes. The first process is a bit-generator which generates a bit at a time for as long as needed. The second process is a buffered writer to disc.

All I got out of the summary was that it is faster to stream the output of the bit generator to the buffered disc writer than to collect all the output of the bit-generater to an in memory storage and then stream the in memory storage to the buffered disc writer. WELL DUH. No mystery there. Stream processing 101.
Re:Check their work or check the summary? by TheCarp · 2015-03-25 04:05 · Score: 1

Except that frictionless spherical cows are not realistic even if they are very helpful in physics.
When is the last time you actually talked to raw hardware? if its recent, you are a special case, and likely write drivers....in which case, good for you.
When you write "to disk" you are working in memory because its going to be a buffered access, likely reads as well, especially if it is something you recently wrote.
Exceptions will exist but, they are exceptions to the rule.

--
"I opened my eyes, and everything went dark again"
Re:Check their work or check the summary? by Frnknstn · 2015-03-25 04:06 · Score: 5, Interesting

It's not even the choice of tools, they seem to willfully misuse the languages to get poor results.

--
If it's in you sig, it's in your post.
Re:Check their work or check the summary? by Anonymous Coward · 2015-03-25 04:11 · Score: 2, Informative

You don't even have to read the code. Reading what languages they used reveals the entire flaw. They used languages with expensive string operations when done in-memory which is the only reason why writing to a buffered cache and writing to disk is faster.
Re:Check their work or check the summary? by wonkey_monkey · 2015-03-25 04:11 · Score: 1

That is pretty much what the article suggests. Concatenating string involves creating objects, blah blah blah...
I doubt you'd see the same "9000 times slower!" kind of results with standard C strings.

--
systemd is Roko's Basilisk.
Re:Check their work or check the summary? by gstoddart · 2015-03-25 04:11 · Score: 1

Somewhat off-topic, but somewhat related:
Many years ago, when I was doing my degree and computers were still steam powered, a friend and I were writing the same assignment.
He worked for the university, and had a privileged account on the VAX. I had the loan of a 286 from a prof who no longer needed it and took pity on me.
I, being constrained by physical memory, had to write a new kind of sparse array to hold my data. He, having access to lots more virtual memory heap than I, wrote a huge array which wasn't sparse to just brute force it, even though most of the array was useless.
At the end of the day, we both got the same results from our programs. The difference was, I got an A+, because the prof decided he'd steal my sparse array for his own research. My friend got an A, because he got the right outcome, but used a slightly less elegant solution.
Optimizing memory is a dying skill, and in the case of most high level languages, there's too many layers between you and the hardware to know what is actually happening.
But some of us still think back nostalgically to when having a 1MB string was simply not possible, and when we used to have to use our own voodoo to cram into the small amounts of memory we did have. :-P
Now get off my damned lawn with your big fancy strings. ;-)

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by squiggleslash · 2015-03-25 04:14 · Score: 1

No, it's completely understandable and shouldn't even be thought of as strange to seasoned programmers.
The critical issue is there's a difference between calling an I/O function like write, and actually manipulating the IDE control lines on a hard disk. Typically for the former, the operating system is sitting there buffering things up in a relatively simple, uncomplex, way - ie it has some memory allocated, a pointer, and when you call the function all it does is copy the bytes to the memory and increment the pointer as needed. Once either enough time has passed, a critical function has been called, or enough data has been written, the OS then starts manipulating the IDE control lines to write the data.
Now, the comparison becomes "the OS's buffer handling" vs "Your language of choice's string handling and garbage collection algorithms." For C, chances are you're as good as the OS as C's string handling is extremely uncomplicated and bare metal. For almost anything else - such as Python and Java, both tested in this scenario - you're likely to end up with the OS handling some situations more quickly than your language would.
Does it make sense now? It should. There are very few programmers this should surprise. Unfortunately, I know quote a few that will be...

--
You are not alone. This is not normal. None of this is normal.
Re:Check their work or check the summary? by Anonymous Coward · 2015-03-25 04:16 · Score: 5, Insightful

Let me guess
1. They used "" + "" instead of StringBuilder
2. They didn't actually flush the file bytes to disk, so it's really a comparison of stupid programmer in-memory string cat and intelligence caching of file writes.
3. They intentionally engineered a scenario that reported data that was contrary to reality in order to get clicks
Re:Check their work or check the summary? by s.petry · 2015-03-25 04:16 · Score: 1

Bah, I wrote the wrong thing...

The simple answer is no, disk is not faster than memory.

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 04:16 · Score: 1

Why bother? Their entire "research" is bogus. Everyone knows that buffered writes are going to be faster than doing byte-by-byte writes.
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 04:17 · Score: 1

Pretty much. This entire article is basically saying that if you do things in the most stupid way possible you can make it slow.
Re:Check their work or check the summary? by s.petry · 2015-03-25 04:18 · Score: 2

Yup yup, I wrote something contrary to my original post quite unintentionally. Memory is always faster than Disk, unless you write shit tests that behave abnormally to make a bogus claim.

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by greg1104 · 2015-03-25 04:22 · Score: 1

It's all operating system cached writes, they're not even getting to the disk's write cache.
Python's file flush() function does not flush data to disk. You have to call os.fsync(f.fileno()) for that.
Same problem with the Java code. flush doesn't make sure data is on disk. You have to use sync or force or something.
This is an excellent way to introduce the smart scientist/moron coder archetype to people though, so it's not completely useless.
Re:Check their work or check the summary? by bondsbw · 2015-03-25 04:23 · Score: 5, Informative

Specifically, the time measured to write to memory uses the following code:
for (int i=0; i < numIter; i++) { concatString += addString; }
The time measured to write to disk uses the following code:
for (int i=0; i < numIter; i++) { writer.write(addString); } writer.flush(); writer.close();
In Java, strings are immutable. Each string concatenation produces a new string on the heap, and the old string is unchanged. So there are numIter strings created in memory, and I assume garbage collection will probably happen at some point once enough memory is used. O(n) reads and O(n) writes to the heap with O(n^2) memory usage plus an unknown number of garbage collections. This can cause considerable slowing of the in-memory algorithm.
That algorithm is then compared with one that does numIter writes to a buffer, which is then flushed to disk at the end. O(n) writes to memory buffer (no need to re-read memory) using O(n) memory space, followed by O(1) writes to disk and O(n) disk space used.
Granted, it's been over a decade since I took algorithms so I wouldn't doubt that someone can show how I am off, but this kind of thing should be simple to spot for anyone who has an undergrad CS degree.
PS - I love how the paper makes this aside as if it doesn't matter tremendously:

Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.

--
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
Re:Check their work or check the summary? by aethelrick · 2015-03-25 04:27 · Score: 1

This was of course compounded by the fact that they did not follow the languages own guidelines with regard to string concatenation. Nor did they demonstrate any clear understanding of how modern operating systems works. Sadly this was an all round a poor effort.
Re:Check their work or check the summary? by w3woody · 2015-03-25 04:29 · Score: 1

Really, what's happening is that they're performing repeated concatenations of various length strings--an operation that eventually becomes O(m*n) time, with m being the length of the string and n being the number of strings. (Concatenating strings in Java requires a new string to be created, then the contents of the two source strings copied into the new destination.) Appending a file, on the other hand, is only an O(n) operation, but has a very large constant time associated with it. So, in essence: TL;DR: O(n**2) operations can be slower in memory than O(n) operations on disk for large values of n.
The real lesson is that you should understand what's going on underneath the hood. And in this case, if you're doing a lot of string concatenation operations in Java, you probably should be using the StringBuilder class. I mean, after all, that's why there are multiple ways to do the same thing in Java (like ArrayList verses LinkedList): each offers different performance characteristics, and at the fringes performance characteristics can kill your application.
Re:Check their work or check the summary? by danlip · 2015-03-25 04:30 · Score: 5, Interesting

The language is not the problem, the code is terrible. They did String concatenation in the most expensive way possible. I'm pretty sure if you used a pre-sized StringBuilder it would be faster in memory.
They also make some very novice benchmarking mistakes.
This is actually a pretty good interview problem. Anyone who writes code like that should not be hired, even for a junior position.
Re:Check their work or check the summary? by halivar · 2015-03-25 04:31 · Score: 1

1. Eyup.
2. They actually did flush.
3. Absolutely.
Re:Check their work or check the summary? by P.+I.+Staker · 2015-03-25 04:34 · Score: 1

I actually think the paper is relatively readable, easy to understand, and complete in it's explanation (not to mention pretty short). They explain exactly why they got the results they did, and what can be done to improve the in-memory version. This is an argument against expecting code to automatically be faster when executing in memory. Basically, they found that the overhead of performing string operations, using standard methods in high level programming languages, caused the in-memory performance to be poor. No part of this paper is trying to claim Seek/Read/Write time of disk approach that of memory. The authors are telling you to take the full system and programming language considerations into account and not assume in-memory means faster, especially now that systems with tons of memory are becoming common.
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 04:35 · Score: 1

Flush in Java doesn't actually force the OS to write to disk.

If the intended destination of this stream is an abstraction provided by the underlying operating system, for example a file, then flushing the stream guarantees only that bytes previously written to the stream are passed to the operating system for writing; it does not guarantee that they are actually written to a physical device such as a disk drive.
Re:Check their work or check the summary? by gstoddart · 2015-03-25 04:38 · Score: 2

I think what they've proven is that there are so many layers in modern programming languages that most of what programmers do because it seems like a good idea probably generates terrible outcomes.
This actually explains a lot about modern programs, and how 5 years later a machine with twice the resources takes the same amount of time to do something as 5 year old software.
Because the bloat and inefficiencies added in those five years offset any other improvements. :-P

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by waterford0069 · 2015-03-25 04:39 · Score: 1

And the Python gets even worse when they prepend `addstring` instead of appending it; making the in memory inherently different from the to disk test.
catString = addString + concatString # modified: concatString = concatString + addString
Re:Check their work or check the summary? by halivar · 2015-03-25 04:43 · Score: 4, Insightful

And this is why we should not teach CS101 in Java or Python. If they'd been forced to use C this whole experiment would have turned out differently. Even the professors are getting lazy, now.
Re:Check their work or check the summary? by Jerry+Atrick · 2015-03-25 04:43 · Score: 2

The direct disc write should also manages to overlap write to the stream object with flushes from it to the underlying drive. Except of course it doesn't because they aren't writing enough data for the disc write to actually start before they're done. I'm also a little confused about why they think flush+close is synchronous, it's going to return instantly and flush data in the background. So they aren't even timing what they think they are.
Back in the world of programmers with a clue, I did fix an in-memory piece of insanity like this not long ago. Making buffer expansion allocations more aggressive got a 10,000x speed improvement.
Dumb concatenation is for lazy or dumb programmers. Programmers that lazy probably could benefit from using more efficient append ops in the streaming libs, even if they don't understand why it works.
Re:Check their work or check the summary? by wardeana · 2015-03-25 04:44 · Score: 1

Actually 'modern' (last few years) Java compilers will change that += behind the scenes to a StringBuilder / StringBuffer so that really shouldn't have any effect... agree that it used to be horrendously bad to do that though.
Re:Check their work or check the summary? by halivar · 2015-03-25 04:44 · Score: 1

I stand corrected. And I assert that a professor of CS from Calgary or BC should know better than me, anyhow.
Re:Check their work or check the summary? by Anonymous Coward · 2015-03-25 04:48 · Score: 2, Interesting

Fixed their code by using a StringBuilder and moving the flush call inside the loop, so it actually writes it to disk.
The result:
In-memory mean: string time 0.008900000000000002
In-memory mean: file time 0.0034000000000000002
Disk-only mean: file time 1.1747
Yes, it's still quicker to do things in memory, you just have to do it right.
PS: with just one flush:
In-memory mean: string time 0.0091
In-memory mean: file time 0.0038000000000000004
Disk-only mean: file time 0.026599999999999995
Still faster in memory.
Re:Check their work or check the summary? by PRMan · 2015-03-25 04:48 · Score: 1

Plus is faster if you are not in a loop.

--
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Re:Check their work or check the summary? by chilenexus · 2015-03-25 04:50 · Score: 1

Seek time alone is always slower than memory, even before you add in the latency and read/write times. It's disturbing to me that this article calls referring to memory being faster an "assumption". In college they had us do the math on paper to figure out the average latency and read times for a given RPM, and how they come up with the average seek times. The only assumption being made is that the manufacturer is honestly reporting figures accurate to a single order of magnitude.
Re:Check their work or check the summary? by rjstanford · 2015-03-25 04:51 · Score: 1

Have you ever read Programming Pearls? Full of great examples of how to do elegant work with no resources that apply just as much today (just prepend "giga" to all of the numbers they use). From your comment I think you'd enjoy it.

--
You're special forces then? That's great! I just love your olympics!
Re:Check their work or check the summary? by devent · 2015-03-25 04:53 · Score: 1

All they did compare was one memory access vs. another memory access, and showed that Strings are inefficient compared to a byte array. Because the first code will concatenate Strings in memory, the second code will concatenate byte array data in memory, and then both are written to disk. The disk-access test should have been:
for (int i=0; i numIter; i++) {
writer.write(addString);
writer.flush();
writer.close();
}

--
http://www.mueller-public.de - My site http://www.anr-institute.com/ - Advanced Natural Research Institute
Re:Check their work or check the summary? by bwcbwc · 2015-03-25 04:54 · Score: 2

And there goes another grad student's research thesis up in smoke. CS departments need to have more courses that distinguish between abstract theory (raw algorithms) and software engineering (practical effects of choosing specific languages and features). It's clear the authors of this are in an ivory tower where every string type is the same type of construct in every language.

--
We are the 198 proof..
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 04:56 · Score: 2

Except any half-decent Java developer uses Stringbuilder not + concat because everyone knows the latter is slower and causes more to be objects created. The only thing they proved is by purposefully doing something wrong you can make it crappy.
Re:Check their work or check the summary? by Anonymous Coward · 2015-03-25 04:57 · Score: 1

I don't agree with that. What I would say is for a junior position, the candidate should be able to see why it's flawed.
Re:Check their work or check the summary? by Anonymous Coward · 2015-03-25 04:58 · Score: 3, Informative

Changed Java code to use StringBuilder instead of String += String. Results on my machine:
1: 0.010625
10: 0.002375
100: 0.001
Maybe somebody who study Chemical and Biological Science is not good developer
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 04:58 · Score: 1

But they didn't do direct disk write. They merely flushed to disk cache handled by the OS.
Re:Check their work or check the summary? by CanadianMacFan · 2015-03-25 05:00 · Score: 1

In my first year of engineering I had a class in assembly but on a simplified processor simulator. I learned quite a lot on it because you saw how everything got modified by your commands. The lectures were terrible because the prof told the same jokes from the book that he wrote. I still have the software somewhere but it's on a 5.25" floppy.
Actually it's on my project list to create a similar program for the iPad if I ever get the time or the help to do it.
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 05:01 · Score: 1

Except everyone alreadys know that using that method of string concatenation is slow. This paper was just pages of waste for what is a one sentence guideline: "Use Stringbuilder to more efficiently do string concatenation." But any compentent Java programmer already knows that.
Re:Check their work or check the summary? by VGPowerlord · 2015-03-25 05:06 · Score: 1

Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.
What's worse is that there are warnings all over the 'net to not use string concatenation in a loop in Java. So, despite these warnings, they did that anyway and tout incorrect assumptions based on their faulty testing.
That's without even considering the other flaw you pointed out (not flushing after each write).

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Check their work or check the summary? by gstoddart · 2015-03-25 05:07 · Score: 1

No, but suddenly I'm intrigued.
Sometime's it's tough to explain to the kids these days why they take too damned much for granted with their languages.
I knew a guy with a Masters in CS who loudly proclaimed optimizing was a pointless exercise.
He wrote some of the shittiest, slowest, and un-maintainable code I've ever seen because he was confusing "clever" with "smart" and often "clevered" himself into corners he couldn't get back out of.
Often he couldn't make minor changes to his own code because it was so "elegant" as to be brittle and impossible to change.
Very often the attitude of "the library does everything, let it deal with it" means you have no idea of how bad the code you're writing actually is.

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by Coryoth · 2015-03-25 05:12 · Score: 2

And this is why we should not teach CS101 in Java or Python. If they'd been forced to use C this whole experiment would have turned out differently.
Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.

--
Craft Beer Programming T-shirts
Re:Check their work or check the summary? by gstoddart · 2015-03-25 05:12 · Score: 1

You know, I would argue that saying "everyone knows" is overly optimistic, bordering on naive.
Because I've seen many programmers who simply don't know, and just assume they're all equal and magic.
This was true in C 25 years ago, Java 15 years ago, and probably every other language now.
Do not underestimate the capacity of humans to be clueless and assume they know what they are doing.
My guess, audit a sufficiently large amount of code, and you'll quickly realize people simple do NOT actually know what you think everyone does.
I'm betting there's a lot of crappy code in the world which neither knows nor cares what actually happens.
My personal experience tells me there are more mediocre coders than actual good ones.

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by Daniel+Hoffmann · 2015-03-25 05:16 · Score: 2

Looks like someone forgot to use StringBuilder
Re:Check their work or check the summary? by djbckr · 2015-03-25 05:18 · Score: 1

This was written by some guys that really don't know how computers work. Seriously. They have not studied algorithms nor understand how Java/Python works under the hood, nor how the operating system I/O subsystem works (specifically caching). How this wound up on /. I just don't get it. If you are doing string concatenation, at least try to do it the right way.
Re:Check their work or check the summary? by danlip · 2015-03-25 05:22 · Score: 1

But they are in a loop, so what's your point?
And to be more pedantic, I don't think "+" is ever faster in Java, it gets expanded to StringBuilder at compile time. "+" is more readable and not slower if you can do it all in one statement. If spread out across control statements (if/else or loops) then use StringBuilder.
Re:Check their work or check the summary? by MerlynEmrys67 · 2015-03-25 05:29 · Score: 1

Here is how you make the results faster.
Compare:
1) Calculating 1MB of data and writing it to disk
vs.
2) Calulating 100K of data, writing it to disk, repeat 10 times.
If the time to write to disk is what takes most of the time, then getting the operation started early and writing to disk in parallel to calculating your data will always win. In their case - what they have done is made the in memory operation exceedingly stupid so it takes too much time. I could trivially write C code that blows their operation out of the water, unfortunately - their Java/Python code are hiding a LOT of inefficiencies in the in-memory operation. So compare their disk operation, with a cache line optimized in-memory calculation and a disk cache optimized disk write operation... It won't even be close.

--
I have mod points and I am not afraid to use them
Re:Check their work or check the summary? by Lunix+Nutcase · 2015-03-25 05:33 · Score: 1

You know, I would argue that saying "everyone knows" is overly optimistic, bordering on naive.
As would I which is why I never used that phrase. Your quote is not my words.
Re:Check their work or check the summary? by gstoddart · 2015-03-25 05:37 · Score: 1

As would I which is why I never used that phrase. Your quote is not my words.
"Except any half-decent Java developer uses Stringbuilder not + concat because everyone knows the latter is slower and causes more to be objects created"
Oh? Really? So you did not use the words "everyone knows"?
Exactly whose words were they inside of your post?

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by sanosuke001 · 2015-03-25 05:41 · Score: 1

This is exactly why Java has a StringBuilder class.

final long startTimeBad = System.nanoTime() String s = "" for(int i = 0 i s += (char)Math.floor(Math.random() * 256) } final long endTimeBad = System.nanoTime() System.out.println(s) System.out.println("bad: " + (endTimeBad - startTimeBad) + " ns") final long startTime = System.nanoTime() final StringBuilder sb = new StringBuilder() for(int i = 0 i sb.append((char)Math.floor(Math.random() * 256)) } final long endTime = System.nanoTime() System.out.println(sb.toString()) System.out.println("good: " + (endTime - startTime) + " ns") System.out.println("diff: " + (endTimeBad - startTimeBad) / (double)(endTime - startTime))

bad: 399644443512 ns
good: 76023788 ns
diff: 5256.834130811792

Note: semicolons removed because yay slashdot comments

--
-SaNo
Re:Check their work or check the summary? by sanosuke001 · 2015-03-25 05:43 · Score: 1

Again, because of slashdot comments, I didn't realize that the internal loop code was "lost"
bad loop: s += (char)Math.floor(Math.random() * 256)
good loop: sb.append((char)Math.floor(Math.random() * 256))

--
-SaNo
Re:Check their work or check the summary? by sanosuke001 · 2015-03-25 05:45 · Score: 1

gd. slashdot can... ugh. both loops go to 1000000

--
-SaNo
Re:Check their work or check the summary? by Bengie · 2015-03-25 05:48 · Score: 1

Yeah, the flushed after the fact

writer = new BufferedWriter( new FileWriter("test.txt"));
startTime = System.currentTimeMillis();
for (int i=0; i < numIter; i++) {
writer.write(addString);
}
writer.flush();
writer.close();
Re:Check their work or check the summary? by Obfuscant · 2015-03-25 05:49 · Score: 1

In my first year of engineering I had a class in assembly but on a simplified processor simulator.
The first assembly class when I took CS was on a Cyber 6500. The assembly language was absolutely trivial to learn. The hardest part was keeping track of which ops were 15, 30, or 60 bits long so you could pack them into memory efficiently.
The next was PDP 8. Even simpler. But magical.
Re:Check their work or check the summary? by Rakarra · 2015-03-25 05:51 · Score: 3, Insightful

Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.
Yes, but we're talking new programmers here. At least in C, you're forced to have to explicitly write inefficient code. New programmers know what malloc does (if they don't, they're behind in their classes). In Java and Python, things are done for you. That can be good! It frees you from a bit of micromanagement. But again, for a new programmer, it's not apparent that they're doing something especially inefficient because the work happens invisibly. It's obvious when you have to malloc() a whole new string buffer in C every time you append to a string. It's less obvious in Java when you just append and the runtime ends up creating a new buffer on the heap for you. ASM is perhaps a bit TOO low-level and weird to start a new programmer on, but I think a full OOP language like Java or scripting language like Python might be too high-level and encourage bad habits to develop. In my CS classes, C hit a pretty good sweet spot.
Then again, you can program badly in any language, and C has its own perils.
Re:Check their work or check the summary? by sjames · 2015-03-25 05:55 · Score: 1

You've mis-understood the purpose of the paper. Read it with an open mind and you will note that they deliberately selected the most expensive string handling. It strongly suggests that they were well aware that they were stacking the deck, but doing so in a way that a programmer might carelessly fall into the trap. The intent was to educate rather than break new ground.
Re:Check their work or check the summary? by Ksevio · 2015-03-25 05:59 · Score: 1

The experiment is a little strange. In both cases, they're trying to write a string to a file. In one case they're doing terrible concats as you guessed, then writing to the file. In the other case, they're just writing to the file. Shockingly performing the extra step takes extra time.

They even used a class called "BufferedWriter" to write to the file stream in Java - what did they think that might be doing?

Don't worry though, they did throw a flush() in there right before the close()
Re:Check their work or check the summary? by Immerman · 2015-03-25 06:11 · Score: 1

Sure, if you expose enough of the implementation details it becomes obvious when you're doing something stupid
for (int i=0; i numIter; i++) {
char * newString = malloc ( strlen(concatString) + strlen(addstring) + 1);
memcpy (...);
memcpy(...) ;
free(concatString);
concatString = newString;
}
Suggests strongly a number of different potential optimization routes, though most would require either multiple traversals of the source material, or a growable buffer which would greatly increase the code complexity (or does C offer such a thing these days? I've been on ++ for a long while now)
You've got to admit though, that's a lot uglier to work with for most incidental string-manipulation where performance is largely irrelevant.

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.
Re:Check their work or check the summary? by bondsbw · 2015-03-25 06:12 · Score: 1

I assumed each entire string could be read or written in one operation, that's why I used O(n). But that certainly may have been a faulty assumption, considering memory/disk paging and such things that I rarely deal with at my day job.

--
All my liberal friends think I'm a conservative, all my conservative friends think I'm a liberal.
Re:Check their work or check the summary? by sribe · 2015-03-25 06:14 · Score: 1

Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.
It's not actually that stupid, FYI. It's a tradeoff, and the advantage of doing it this way is greatly reduced locking when you have multiple threads and the possibility of strings being shared between threads.
Of course, the paper about which this article refers, is still garbage ;-)
Re:Check their work or check the summary? by gnasher719 · 2015-03-25 06:20 · Score: 1

You don't even have to read the code. Reading what languages they used reveals the entire flaw. They used languages with expensive string operations when done in-memory which is the only reason why writing to a buffered cache and writing to disk is faster.

No, string operations in Java are not expensive. They are expensive if you do them stupidly.
Re:Check their work or check the summary? by gnasher719 · 2015-03-25 06:26 · Score: 1

That is pretty much what the article suggests. Concatenating string involves creating objects, blah blah blah...

I doubt you'd see the same "9000 times slower!" kind of results with standard C strings.
Of course you do. Assuming you allocated a big enough buffer at the start, strcat takes time proportionally to the length of the first string, because it has to search for the zero byte at the end of that string. If you want it fast, use C++ std::string, or Objective-C NSMutableString.
Re:Check their work or check the summary? by gnasher719 · 2015-03-25 06:30 · Score: 1

I knew a guy with a Masters in CS who loudly proclaimed optimizing was a pointless exercise.
In many cases, it is true. Not being able to optimise however is quite bad. On the other hand, in my experience when I was given code that ran too slow, it almost never was because it wasn't optimised, but because it did something stupid (like some code that downloaded n files and took O (n^3) time; worked fine with n = 10 but when I tried with n = 200 it just broke down). Changing that to O (n) isn't what I would call "optimising".
Re:Check their work or check the summary? by hibiki_r · 2015-03-25 06:47 · Score: 1

It might have helped in this problem, but nowadays, even assembly language is just an abstraction: You might thing you are doing in order operations on 8086, but they are really being translated to out of order operations inside of the processor that will get the same result, but with very different performance. Branch prediction? Nah, we can run the beginning of BOTH branches, and just discard the computation we did not want, because it's actually faster. And don't get me started on the differences between what you tell a video card to do, and what it actually does.
The distance between what we write in practical, end user facing applications and what happens in the hardware is so large nowadays that it's hard to have any real control over what is going on. The best we can do is understand the performance characteristics that we see in the layer right below ours, and hope things don't change too much.
Therefore, the problem with the original paper is that it fails to really explain what we can learn from the experiment. It's not that disk is faster than RAM: That's just ludicrous. But that we really have to have some understanding of the libraries and VMs we use to get anywhere. It'd not be impossible for the JVM to realize the immutable string is being edited in a loop, that there are no references to it that could escape, and then just optimize the whole thing into a string buffer implementation that should be as good as calling the file writer: It just happens to not do said optimization for us. It's happened in Java before: Code that was seen as terrible because it was very slow is not slow anymore, and easier to read than the old school way of optimizing it.
Re:Check their work or check the summary? by waterford0069 · 2015-03-25 06:49 · Score: 1

If they wanted to do that, all they had to do was post this. http://www.joelonsoftware.com/...

I think they missed that the += operation is an O(n) operation, and that putting it in a loop made it an O(n^2) operation - as compared to the O(n) operation of writting to the file butter. If they hadn't missed it, then they would have come right out and said that when they were talking about using StringBuilder or StringBuffer.
Re:Check their work or check the summary? by rnturn · 2015-03-25 07:10 · Score: 1

Your friend with the privileged account (I'm assuming this was running VMS, no?) might have been able to get away with using as much memory as he could. (Let's just hope he wasn't using an account with BYPASS privs enabled by default. I've encountered too many people who abused VMS boxes by setting their account up that way. They made terrible messes. Like code that can't run when accounts with less than "god" privileges are used. The problems they created were a pain -- and, sometimes, impossible -- to clean up.)
I had an intern discover that it was indeed faster to not do everything in memory. He was reading everything from a file into an array, applying some scale factor to all the elements, and then writing the entire array out to disk. It was taking forever. I had him try reading in one value at a time, apply the scale factor, and immediately write it out. Ran in a fraction of the time. Why? His account was on a system with a lot of other users. Memory quotas were enforced to avoid a single user taking over the entire system. When he read everything into memory, VMS was paging like crazy to fit his process into the smallish amount of RAM his quotas allowed and his program was spending a huge amount of time waiting for I/O to complete.
I'm guessing that I'd see the same thing happening on my Linux systems if user account ulimits weren't, by default, all set up as "unlimited".

--
CUR ALLOC 20195.....5804M
Re:Check their work or check the summary? by ripvlan · 2015-03-25 07:19 · Score: 1

Yes thank you. This is a classic mistake that many first time programmers make. Years ago somebody was comparing the speed of C++ over VBScript/IIS over Java for writing Web pages back to a browser. They too came to a similar conclusion - and also made the same mistake.
When writing to disk - the data is written once. The algorithm for in memory is not doing the same thing. It is allocating a new buffer - copying all data to said new buffer - and finally adding data to the end. If one compared the I/O of the two program executions the "in-memory" version would have many times MORE I/O.
Doing this:
Loop N.{ x = x + newValue}
Write(x)
Will always be slower than
Loop N.{ Write(newValue) }
In C# and I think Java - there is an object called StringBuffer - and is intended for this kind of workload. The first thing I learned in data structures class was how to expand buffers using different algorithms and pro/con of each (heap design for instance, buddy system).
Plus - O(n) is not created equal. :-P
Re:Check their work or check the summary? by sjames · 2015-03-25 07:29 · Score: 1

They were QUITE clear in the paper that they deliberately chose the least optimal string handling. They stated it EXPLICITLY. It's not their fault you skimmed with intent to belittle.
Re:Check their work or check the summary? by BarbaraHudson · 2015-03-25 07:46 · Score: 1

Now think about the future when "everybody kan kode!" Your 16-core 4ghz cpus will choke.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:Check their work or check the summary? by OverlordQ · 2015-03-25 07:50 · Score: 4, Informative

THATS THE ENTIRE POINT OF THIS PAPER.

It is easy to explain the results: In high-level languages such as Java and Python, a seemingly benign
statement such as concatString += addString may actually involve executing many extra cycles behind
the scenes. To concatenate two strings in a language such as C, if there is not enough space to expand
the concatString to the size it needs to be to hold the additional bytes from addString, then the
developer has to explicitly allocate new space with enough storage for the sum of the sizes of the two
strings and copy concatString to the new location, and then finally perform the concatenation. In Java
and Python strings are immutable, and any assignment will result in the creation of a new object and
possibly copy operations, hence the overhead of the string operations. The disk-only code, although
apparently writing to the disk excessively, is only triggering an actual write when operating system
buffers are full. In other words, the operating system already lessons disk access times. A developer
familiar with the language and system internals readily notices the causes of this observed behaviour,
but this behaviour may be easily missed, as indicated by examining similar cases in production code.

--
Your hair look like poop, Bob! - Wanker.
Re:Check their work or check the summary? by OverlordQ · 2015-03-25 07:51 · Score: 1

That's their entire point.

--
Your hair look like poop, Bob! - Wanker.
Re:Check their work or check the summary? by ewibble · 2015-03-25 08:02 · Score: 1

They basically they didn't test the same thing,
They tested concatenating 1 byte, 10 byte and 1000 byte strings in memory and disk.
In java each time you append a string you allocate an new piece of memory and copy the old string into it. When you append do it on disk it will allocate the data in blocks. The the proper way of doing this type of operation in to use a StringBuffer or StringBuilder.
The string concatenation method is order n^2 where with pre -allocation and just appending it is order n. The whole thing about order is k*n^2 will always get slower than j*n, for a big enough value of n, no matter how much bigger j is than k.
So what they are saying is don't do really stupid stuff in memory or it could be slower than doing not so stupid stuff on disk. No shit.
Oh yes then they wrote the string out to disk in the memory test as well, effectively doing the same operation as disk only method. Just to be on the safe side so your operation in memory must be slower.
Either these "researchers" are stupid or just want to make the headlines.
Re:Check their work or check the summary? by s.petry · 2015-03-25 08:10 · Score: 1

Also, they are from a biology department, what do you expect?
Self sterilization and removal from the gene pool?
you asked....
probably a snark tag needed...

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by s.petry · 2015-03-25 08:12 · Score: 1

No, it was the subsequent post that skipped QA.

--
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Re:Check their work or check the summary? by waterford0069 · 2015-03-25 08:14 · Score: 1

Oh, I read the whole paper through. It was this line that got me.

a seemingly benign statement such as concatString += addString may actually involve executing many extra cycles behind the scenes.

I don't think they understood that += was taking a 1,000,000 cycles to execute when they got to the 1,000,000'th concatenation. I think imply by omission that the copy operation is O(1); which clearly it is not.

They do correctly identify the problem being in the data copy

The problem is caused by data copy operations in the main memory, whether in the heap or stack space. We also see that immutable strings are not inherently a problem, as evidenced by Python’s much better performance with the modified code.

But they don't identify that it's because their copy is taking longer and longer each time they do the append. Instead the leave it up to the "magic" of the append. A true O(1) algorithm for appending text (StringBuilder) would not have had the same problem; it's still doing memory copies, just FAR fewer.

It's not that there's one piranha in the river that's the problem; it's there is 1,000,000 piranha in the river. Likewise I'd rather face one piranha than one grisly-bear and 10 piranha than 10 grisly-bears; but 1 grisly-bears vs 500,000 piranha
Re:Check their work or check the summary? by jfbilodeau · 2015-03-25 08:33 · Score: 1

OMG C is SOOOooOO slow compared to my l33t Java code! In the spirit of the fine paper related to this article, here's a 100% fair, unbiased comparison of both languages. [code lang="c"] #include <string.h> #include <stdlib.h> #include <stdio.h> #include <time.h> int main(int argc, char** argv) { const char* appendString = "1"; char* concatString = (char*)malloc(1); concatString[0] = 0; int numIter = 100000; clock_t startTime, endTime; startTime = clock(); for (int i = 0; i < numIter; i++) { char* tempString = (char*)malloc(strlen(concatString) + strlen(appendString) + 1); strcpy(tempString, concatString); strcat(tempString, appendString); free(concatString); concatString = tempString; } endTime = clock(); double totalTime = (double)(endTime - startTime) / CLOCKS_PER_SEC; printf("Operation took %f seconds", totalTime); return EXIT_SUCCESS; } [/code] [code lang="java"] public class Crap { public static void main(String[] args) { String appendString = "1"; String concatString = ""; int numIter = 100000; long startTime, endTime; startTime = System.currentTimeMillis(); StringBuilder builder = new StringBuilder(); for (int i = 0; i < numIter; i++) { builder.append(appendString); } concatString = builder.toString(); endTime = System.currentTimeMillis(); double totalTime = (double)(endTime - startTime) / 1000; System.out.printf("Operation took %f seconds\n", totalTime); } } [/code] [code lang="term"] $ cc -O3 crap.c -o crap $ ./crap Operation took 0.749926 seconds $ javac Crap.java $ java Crap Operation took 0,009000 seconds [/code] Look how FATSTER JAVA is compared to C!!!!1!! You should be ashamed of using such a sloow programming language like C!

--
Goodbye Slashdot. You've changed.
Re:Check their work or check the summary? by ewibble · 2015-03-25 08:37 · Score: 1

The revelation that if you do something really stupid, its going to be slower than if you don't, not isn't exactly a new revelation is it? They could have stuck solving the traveling salesman problem in memory before writing to disk to, but that too would not have yielded any new revelations.
Big O notation has been around in Computer Science since 1976, (http://en.wikipedia.org/wiki/Big_O_notation#cite_note-knuth-11) this is hardly new research.
Re:Check their work or check the summary? by Kleanthes · 2015-03-25 08:42 · Score: 1

Would be even faster if you intialized the StringBuilder with the numIter. Hard to measure, though, had to go up to 10 million 1 byte strings to actually see a difference ;-)
Re:Check their work or check the summary? by UnknownSoldier · 2015-03-25 08:56 · Score: 1

> Optimizing memory is a dying skill,
It is now called Data Orientated Design.
Google+ Group
* https://plus.google.com/+Datao...
Data-Oriented Design and C++
* https://www.youtube.com/watch?...
Typical C++ Bullshit
* http://macton.smugmug.com/gall...
Pitfalls of Object Oriented Programming
* http://research.scee.net/files...
* http://www.slideshare.net/royc...
Re:Check their work or check the summary? by sjames · 2015-03-25 10:31 · Score: 1

REALLY?!? You read that and STILL don't grasp that they were aware of how non-optimal it is? REALLY?!?
Re:Check their work or check the summary? by Nemyst · 2015-03-25 11:06 · Score: 1

No, what happens is that people will pre-allocate a super large memory buffer (and if their buffer is too small? whoops), or they'll completely forget to do free(), or they'll just copy code from the internet that does it for you.

Incompetence crosses language boundaries, if you think C would make them learn any faster, you're utterly kidding yourself.
Re:Check their work or check the summary? by pjt33 · 2015-03-25 12:17 · Score: 1

Some research into what proportion of programmers do know this kind of basic stuff would have made a much better paper.
Re:Check their work or check the summary? by gnupun · 2015-03-25 12:50 · Score: 1

That flush() call is unnecessary. The writer object calls flush() as part of close()ing the stream. The writer object also flushes its buffer periodically to disk as it fills up due to the write() calls.
Re:Check their work or check the summary? by Coryoth · 2015-03-25 14:17 · Score: 1

It's stupid if you're benchmarking relative efficiency -- it's not an efficient implementation (and you'll have no trouble finding explanations for why the Python and Java code they wrote, while simpler, is not efficient).

--
Craft Beer Programming T-shirts
Re:Check their work or check the summary? by waterford0069 · 2015-03-25 15:41 · Score: 1

(Q) Did they identify the code that was the cause of the problem?

(A) Yes concatString += addString

(Q) Did they identify WHY that code that was the cause of the problem?

(A) No, they hand waved about += having to do a few more operations than StringBuilder (vs. a metric-butt load that it's doing for a million character string)

(Q) WHY did that code that was the cause of the problem?

(A) In Memory was an O(n^2) algorithm, vs. a O(n) disk algorithm
And I don't believe they understood this, or they would have explicitly explained it in their paper and/or never bothered to publish it
(Q) Is this a problem for the paper?

(A) Yes. The paper title implies that the same algorithm taking place entirely in memory (and one single large disk write) could be slower than one with lots of disk writes. They are clearly not the same algorithm when you look under the hood of string concatenation and writing to a filebuffer.

It's kind of like saying, we had a tortoise and a hare race between point A and point B, pointing out that the tortoise won the race; but neglecting to mention the hare was facing the wrong way, and ran the long way around the world. Oh, but if we switch the hare for a rabbit (which happens to be facing the right way); then the rabbit beats the tortoise - clearly the hare takes longer per stride for some reason

(Q) Are there any other problems with the paper?

(A) Yes, their lots of disk write version of the algorithm is writing to a buggered stream, and while it flushing - there is no guarantee that data has been physically written to the disk before the next iteration started. And since the PC has way more than a million bytes free on it, there's a good chance that the OS didn't have to do a physical write until long after the program finished (the disk array controller may even have a backup battery, so it could be minutes before it actually gets written to disk).

Identifying problems like this with a paper is not belittling the authors. Mocking them for publishing out of their area of expertise may be (re. Biology) or for being a potential expert (PhD. in Electrical Engineering(*)) and making what is clearly a junior coding mistake. But I'm not mocking them. I'm identifying a fundamental problem with their paper. They're grownups in academia - the should expect challenges.

* - Lord knows that he may not have any actual training in software development beyond what he needed to get through school, and may have spent more time working on the hardware aspects of computing - that was the case will all my profs that came from an EE background. PhD. and P.Eng. does not mean infallible programming expert. It means highly specialized in one are of study.

This paper has done nothing to increase the knowledge of the world, and wasted lots of people's time. It's like they published paper saying the world is round when analyzed at a sufficiently large enough scale. Be careful that you don't fly in a straight line from Washington to Japan according to a Mercator projection map because it'll take longer and you'll burn a lot more fuel. Every good pilot in the world knows that and good developers should know the problems of string concatenation (especially in loops).
Re:Check their work or check the summary? by sjames · 2015-03-25 16:30 · Score: 1

You must be a glutton for punishment but I haven't the time or inclination to indulge you so:
*PLONK*
Re:Check their work or check the summary? by goose-incarnated · 2015-03-25 23:10 · Score: 1

And this is why we should not teach CS101 in Java or Python. If they'd been forced to use C this whole experiment would have turned out differently.
Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.
Yeah, but at least then they'd have to actually *write the inefficient code out*, thereby learning why it is inefficient. With Java and Python the novice does not know about the inefficiency because it is hidden behind a "+" operator. This is why OP said to teach in C - you have to implement the concatenation yourself and in the process you learn how not to do it.

--
I'm a minority race. Save your vitriol for white people.
Re:Check their work or check the summary? by goose-incarnated · 2015-03-25 23:25 · Score: 1

I knew a guy with a Masters in CS who loudly proclaimed optimizing was a pointless exercise.
These days it might just be for most use-cases. For example the "research" above show this - the time consumed in 1 million inefficient string concatenations is what... less than 5 minutes? If you only perform a few hundred string concats at a time the program's user won't even notice the delay. If, like most use cases, you only concat a few strings at a time (say, a few tens) the user *certainly* won't notice. Not that I agree with such inefficiencies[1], but I *do* see the "why optimise" PoV.
There are only two rules for optimisation:
1. Don't optimise.
2. (For experts only) Don't optimise yet!
[1]One of my tasks in my first year of employment was to take a TCP stack and port to a different micro. My second task was given when the code was going through tests - I had to speed it up by a factor of two. I understand optimisation, and the important thing that I understand is that I do not have to do it much anymore!

--
I'm a minority race. Save your vitriol for white people.
Re:Check their work or check the summary? by LordLimecat · 2015-03-26 00:29 · Score: 1

String += String
Im in a 200 level java class. We're just learning inheritance. I could have told you why thats a bad way to do things.
Do people not study what arrays are and why its expensive to continually append to them anymore?
Maybe these folks need to go back to basics.
Re:Check their work or check the summary? by sribe · 2015-03-26 02:11 · Score: 1

It's stupid if you're benchmarking relative efficiency -- it's not an efficient implementation (and you'll have no trouble finding explanations for why the Python and Java code they wrote, while simpler, is not efficient).
I think we're talking at cross-purposes. When I said "not actually that stupid" I was referring to the implementation of String as immutable and highly-efficient to share cross threads, and implicitly including that there's StringBuilder for more-efficient building-up of strings. That design is not stupid.
I certainly did not mean that the benchmark or paper were "not actually that stupid". The benchmark was just ridiculously bad, and the paper utterly stupid. As some other poster said, showing that reallocating a string a million times and appending a single character each time is slower than writing a million characters into a buffer, that's literally a high-school level paper--I think I'd add that it's C-level (haha) high school work.
Re:Check their work or check the summary? by UnderCoverPenguin · 2015-03-26 04:12 · Score: 1

It is not always quicker to do things in memory. Statement proved. QED.

Maybe

The disk-only code, although apparently writing to the disk excessively, is only triggering an actual write when operating system buffers are full.

The "disk-only" code is still writing to memory. If anything is proven, it's that the OS is doing its memory management better than the Java or Python run time environments.
Would be interesting to add another case to compare: Open an unnamed pipe, write to "write end" of the pipe in the loop, then read the final string from the "read end" of the pipe.

--
Don't try to out wierd me, three-eyes. I get stranger things than you, free with my breakfast cereal. --Zaphod Beeblebr
Re:Check their work or check the summary? by Trixter · 2015-03-26 07:02 · Score: 1

In other words, the operating system already lessons disk access times.
I guess they don't teach English at this school either.
Re:Check their work or check the summary? by gstoddart · 2015-03-27 12:53 · Score: 1

Well, I'll give you my rule zero for optimizing code ... don't write shitty code relying on more layers of libraries than you can explain what is happening.
My direct experience says most of the people saying "don't optimize" are the ones who wrote the shittiest code in the first place because they simply assume all libraries are fast and efficient.
By the time you've made that shitty and slow code, it's probably too damned late to try to optimize it.
I cut my teeth writing on bare metal, and libraries which were called over and over.
If you don't start with some consideration of what is efficient, and you just do stupid things which rely too much on the library ... no amount of effort later will fix it.

--
Lost at C:>. Found at C.
Re:Check their work or check the summary? by dataminator · 2015-04-08 19:47 · Score: 1

It's really sad that you are the only one who noticed this.
The paper actually makes a pretty good case that you need to be careful with operations that seem cheap but have hidden costs (object allocation), versus others that look expensive but are actually made very cheap behind the scenes (buffering). While this is of course not new, I wouldn't be surprised at all to find this in production code (as they claim), so it's good to raise awareness of the issue.
Also, I was somewhat surprised by the magnitude of the impact. I wouldn't have expected the "disk writes" to be this cheap, or the naive string concatenation to be this expensive, even though the result in general could of course be expected.
Clearly, the authors knew very well what they were doing, and designed the code to illustrate their point. They also explain very clearly what they did and why they get those results, so I really don't see why so many people claim it's deceptive. While not really novel research, I think it's very useful to have this written down so clearly and it's a great resource for new (or even some not-so-new) programmers.

This is the dumbest research I've seen this year by MobyDisk · 2015-03-25 03:53 · Score: 5, Informative

This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!

Java and Python by mi · 2015-03-25 03:53 · Score: 1

Java and Python versions of the code were written and then run on Windows and Linux systems for comparison. The total time of all writes for disk-only version was compared to total time of in-memory operations plus the disk write of the in-memory approach were then compared.

I fear, this article will be referred to for years to come as "evidence", that in-memory work is slower, while the truth is, Java and Python programs are slower, than the properly-compiled (to machine code) programs. TFA says so too:

these higher level languages are doing a lot of work behind the scenes to handle the concatenation, such as creating new objects and copying the strings in order to accommodate the extra bytes of data.

but few people will read that far down...

It is just "too easy" to write code, that will cause the useless object-creation and destruction in these "higher level" languages — and a human mind can not distinguish between a microsecond and a millisecond, so it all seems to work fine — until you need to do it a million times...

--
In Soviet Washington the swamp drains you.

Re:Java and Python by weilawei · 2015-03-25 10:47 · Score: 1

This is no longer happening in CPython. The naive case is specially handled to have O(n) behavior and is significantly faster than flattening a list of individual strings all at once.

"As price of RAM drops" by mveloso · 2015-03-25 03:54 · Score: 1

The price of ECC ram doesn't drop for years and years.

1MB fits in cache by Anonymous Coward · 2015-03-25 03:54 · Score: 1

Did they forget to flush it?

Re:1MB fits in cache by bobbied · 2015-03-25 04:21 · Score: 1

Did they forget to flush it?
Of course not, it's just their program was so much of a turd that the plugged up the plumbing and the bowl overflowed before they could get the plunger. I'll bet they don't even realize that their "write to disk" likely didn't actually happen until long after the OS cached it and the program got told the write was completed.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Re:1MB fits in cache by itzly · 2015-03-25 04:55 · Score: 2

They forgot to flush their research paper.

Wrong title by Torp · 2015-03-25 03:55 · Score: 1

It's slower in languages with automatic memory management, or with a VM, which is no surprise.
It would be much faster than disk if you wrote the time critical parts in a language designed for, you know, speed...

--
I apologize for the lack of a signature.

Re:Wrong title by BradleyUffner · 2015-03-25 04:07 · Score: 1

It's slower in languages with automatic memory management, or with a VM, which is no surprise.
It would be much faster than disk if you wrote the time critical parts in a language designed for, you know, speed...
In this case it's slower because they are not comparing apples to apples. For memory they repeatedly concatenate strings together, which reallocates the memory and copies the string every time. For the disk they allocate the whole file at once and then just stream the data. It would have been a much better test if they had allocated a memory buffer for the string and streamed the data in to it the same way as a the disk.

only useful to cs 101 students by Anonymous Coward · 2015-03-25 03:55 · Score: 1

Not a good paper. Quadratic string appends are a problem, yes, but a solved one. It's why StringBuilder and [].join exist.

Re:Obvious by amalcolm · 2015-03-25 03:56 · Score: 1

They also seem to be doing buffered writes, so they are not writing each byte to disk, one at a time. No JVM warmup time ... very flawed as benchmarks.

--
Time for bed, said Zebedee - boing

a lot of questions about real-world here. by nimbius · 2015-03-25 04:00 · Score: 1

Generally if you're looking to speed things up in RAM its not because youre concatenating a group of strings over and over, its because your overall read time improves dramatically as well. The study also doesnt take into account IO controller overhead...for example the overhead to write to RAM is generally mitigated in intel chips as the northbridge is merged into the processor and takes advantage of cool things like predictive instructions by the ALU. PERC raid controllers and HBA's are typically limited by the bandwidth of the bus and the clock speed of the controller on the other hand, as well as any pending rebuilds or cached data theyre committing or storing at any given time. JBOD configurations in some RAID cards also requre you to build an individual RAID for each disk, meaning the controller could have countless configurations it has to track.

an excellent example of where you want RAM to handle reads and writes is in email antispam. amavis queues get expensvie fast, despite optimized perl threading, but cutting this back to spamass-milter and keeping spamassassin in a ramdisk with its compiled ruleset there too means you can handle nearly the entire evaluation of the message without even touching the incoming queue on disk. issuing rejections at the handshake then greatly improves efficiency over having to issue bounces, which can touch up to 4 queues on disk in some cases.

--
Good people go to bed earlier.

Bad test or is it the headline by ibwolf · 2015-03-25 04:00 · Score: 1

This is a REALLY mind boggling stupid test (or at least headline). Of course it is faster to immediately write stuff to disk as it becomes available, than to build the string in memory and then flush it to disk. Keep the IO bus full while the next write is prepared.

That doesn't change the fact that you should avoid touching the disk as much as possible, it just illustrates that if you must touch the disk, you should try to do it while the processor is busy doing other things (if possible).

Bad code is slower than disk write caching by mveloso · 2015-03-25 04:00 · Score: 1

What they're saying is if you write bad code, it performs like shit. Did someone get a PhD from this?

Re:Bad code is slower than disk write caching by bobbied · 2015-03-25 04:17 · Score: 1

What they're saying is if you write bad code, it performs like shit. Did someone get a PhD from this?
Well, two biology majors did comprise 2/3rds of the contributors to this madness... I sure hope the Electrical and Computer engineer didn't get a PHD for this.. There's no way to defend this with a straight face if you ask me.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101

Re:Color me surprised by amalcolm · 2015-03-25 04:01 · Score: 1

Nothing wrong with the languages .. lousy benchmarks ... if they had been written in C with the same skill level they would have segfaulted!

--
Time for bed, said Zebedee - boing

That's what you'd call an anti-pattern! by cybervegan · 2015-03-25 04:02 · Score: 1

This:

for i in range(0, numIter):
concatString = addString + concatString # modified: concatString = concatString + addString ...from the in-memory part of the experiment. Any Python programmer knows you don't do it this way. Strings in Python are immutable, so this re-allocates concatString every time round the loop, most likely causing multiple garbage collection cycles. It's obvious that this is not written by a Pythonista, as it's not "Pythonic" code. No wonder it's slow.

Better (maybe not the best, tho. Call me lame... )

concatList = []
for i in range( numiter ):
concatList.insert( addString )
concatString = "".join( concatList )

Re:That's what you'd call an anti-pattern! by cybervegan · 2015-03-25 04:29 · Score: 1

It's even worse - they had to rig the code badly to make their point - and the original is still in the comments!
catString = addString + concatString # modified: concatString = concatString + addString
Here, they are *prepending* catString with addString, which means that catString has to be re-allocated and re-copied every single iteration (EXPENSIVE!)
If you revert the code to the original, it runs about twice as fast as the naive straight-to-disk writer. This is due to a cPython optimization that actually extends the memory the string is stored in, and pastes the second string into it, without re-allocating and copying it, thus preventing the quadratic performance of the anti-pattern.
in-memory: String took 0.17798614502, file took 0.00357103347778
disk-only: file took 0.225710868835
When I re-wrote it using a string join, it came out like this:
in-memory: done properly 0.169091939926, file took 0.00345587730408

When publishing an academic paper... by abelenky17 · 2015-03-25 04:03 · Score: 1

one should proof-read enough to avoid grammatical typos in the introduction:

1. Introduction
[...]
Disks, whether mechanical or SSD, have orders ***or*** magnitude higher latency and transfer times

I want to mod this POST down by AttillaTheNun · 2015-03-25 04:04 · Score: 1

for "idiotic premise"

String concatentation by James+Ojaste · 2015-03-25 04:04 · Score: 1

The paper describes using string concatenation in java to prepare the string in memory. In essence, it's comparing an O(1) operation to an O(n) operation and complaining that the latter is slower for large values of n.

Re:String concatentation by 91degrees · 2015-03-25 05:01 · Score: 1

It's not a complaint. It's simply an observation that perceived wisdom isn't always correct.
Re:String concatentation by Jaime2 · 2015-03-25 05:17 · Score: 1

The perceived wisdom is still correct as long as you don't make a grievous implementation error in one of the two things you are comparing. All that they demonstrated is "It is possible to do memory access so badly that it's slower than disk".

Impressive by Minwee · 2015-03-25 04:04 · Score: 1

Unless I have misread the paper, it seems that these folks have just found experimental proof that disk writes are buffered.

"In Java and Python strings are immutable, and any assignment will result in the creation of a new object and possibly copy operations, hence the overhead of the string operations. The disk-only code, although apparently writing to the disk excessively, is only triggering an actual write when operating system buffers are full. In other words, the operating system already lessons disk access times.

I'm guessing that this investigation started with someone making a bet while their thought processes were slightly impaired.

Java Code by Anonymous Coward · 2015-03-25 04:05 · Score: 1

Someone looked at the Java code used ? (PDF p8 "Appendix 1. Java code")

for (int i = 0; i numAdd; i++) {
addString += "1";
}

Why not StringBuilder ? No one want to concatenate (lots of) strings with "+=" in Java because it is not efficient.
Maybe someone should run the test with proper Java string concatenation code and see the results, then you could tell.

Re:Java Code by __Reason__ · 2015-03-25 04:12 · Score: 1

"+=" does use StringBuilder - at least since Java 1.5 or so. But it still, of course, allocates a new String for every iteration of the loop.
Re:Java Code by topology · 2015-03-25 04:44 · Score: 1

Which ultimately should be optimized away by a good compiler. If the string is never read in the context of the loop, leave it in the StringBuilder until the loop is exited or until the StringBuilder content is passed in a function call, then render a String from it.

Ultimately this is a failure on the part of the compiler writers to not handle a very obvious optimization. (obvious to those versed in optimizing loop code as a compiler writer should be).

Re:dumb test? by topology · 2015-03-25 04:06 · Score: 1

That is what they have proved. The problem is they don't even realize that this is exactly what they proved. If they had realized it (and they had any intelligence), they would be far too embarrassed about having spent so much time on this to mention it to anyone. "Look Ma, I proved 1+1 > 1"

Re:This is the dumbest research I've seen this yea by captnjohnny1618 · 2015-03-25 04:07 · Score: 1

Glad I wasn't the only one thinking this. I read the article and thought that I had missed something since their "task" and "conclusions" we're so trite.

String concat in java by doctor_shim · 2015-03-25 04:07 · Score: 1

is an expensive in-memory operation, as it is in many other high-level languages. Unsurprising that writing a 1MB string to disk is faster.

Re:Rubbish by __Reason__ · 2015-03-25 04:09 · Score: 1

Stop trying to suppress academic research! Papers are written by scientists, and they deserve respect!

Distinction between in memory & disk meaningle by chaircrusher · 2015-03-25 04:13 · Score: 1

One doesn't simply 'write to hard disk'. You ask the operating system to do that. The OS buffers writes and sends them to the disk in what it thinks is the most effecient blocking. And when you 'write to disk' you really 'ask disk device to write to disk (eventually)' A hard disk any more is a little computer dedicated to storing data on disk and retrieving it as quickly as possible.

No, It's Not Always Quicker To Do Things In Memory by hcs_$reboot · 2015-03-25 04:13 · Score: 2

No, It's Not Always Quicker To Do Things In Memory

The title ("No, It's Not Always Quicker To Do Things In Memory") should be modded Flamebait, Troll or similar. If it'd be possible.

--
Slashdot, fix the reply notifications... You won't get away with it...

Re:Chemists and Biologists by bobbied · 2015-03-25 04:14 · Score: 1

They have nothing better to do with their time than benchmarking bogus string operations?

In JAVA and Python nonetheless. Anybody who tries to draw conclusions about HARDWARE performance who uses JAVA and Python are off their rockers out of the gate. Testing the speed difference between memory and disk in Java is problematic and Python is not much better. In this case the problem really is their programming skills though. But what do you expect... Out of the three authors, only ONE isn't a biology major...

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101

Stupid premise - apples and oranges by Dr_Barnowl · 2015-03-25 04:17 · Score: 1

They're only examining the performance of concatenating immutable strings, versus the performance of writing to a (buffered) stream.

This is a problem that's been known about for donkey's ages. It's just that computers are so stupidly powerful it's no longer an issue that many programmers ever have to confront.

In VB6 you had to jump through hoops to do it properly, but it's such a common case in Java that the compiler will optimize repeated concatenations in a loop into using a StringBuilder instead. I presume Python has similar optimizations.

I used to just stick all the strings in an array, allocate a new string of the appropriate length, and copy them into it.

News at 11 : many less experienced programmers are ignorant of the internal workings of their chosen frameworks, because they never had to write their own implementation at a lower level.

It is easy to write slow code in an application by cloud99 · 2015-03-25 04:18 · Score: 1

There are SO MANY layers of caching between the application layer and the physical disk it is not possible for most applications to know that they are actually writing to disk. This is simply one example of that. Additionally, crappy application layer code runs slowly. Yes, it makes a HUGE difference how you write your code even today in how quickly it executes. High level languages simply make it easier to write code which executes slowly for no apparent reason.

Re:Chemists and Biologists by bluefoxlucid · 2015-03-25 04:18 · Score: 1

It looks like they were discussing methodology performance ("Do everything in memory! Write() is slow!"), not hardware performance.

--
Support my political activism on Patreon.

Re:This is the dumbest research I've seen this yea by c · 2015-03-25 04:18 · Score: 2

Pretty much my thoughts. Writing to disk is slow, but it's also semi-async operation (in that much of the time, the job is offloaded to the I/O subsystem before the write is complete), which generally means the sooner you start writing your results the sooner you'll finish, and if you start early you can do computational work while the I/O is happening rather than spinning wheels while trying to write the whole thing in one go. All they seem to have done is add a pile of latency and may even have introduced other impacts such as garbage collection or VM swap.

--
Log in or piss off.

What the article doesnt say is telling by Big+Hairy+Ian · 2015-03-25 04:18 · Score: 1

Basically the article doesn't give enough detail. It doesn't say whether the strings were created using the base string objects in Java/Python or using the much more efficient stringbuilder objects. The former would be horrendously slow. Also what was the base setup of the machines being tested on how much memory did they have? did their disk controllers have built in cache? What kind of disks were used.

--

Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

Bad applications and programming languages! by Terje+Mathisen · 2015-03-25 04:22 · Score: 1

What they actually compared wasn't the speed of the disks, but the speed of the language runtime and OS file IO buffering routines!

It wasn't really that surprising that concatenating java or phyton objects can be slower than letting the low-level runtime do the same task.

If they had wanted to test the disk IO speed then they would have had to add at least some fflush() calls.

It is trivial, in any language, to make your code faster than the actual disk transfer speed, but a lot harder to make it faster than a set of small block moves within (cached) RAM.

Terje

--
"almost all programming can be viewed as an exercise in caching"

They are not using StringBuffer or StringBuilder by RockGrumbler · 2015-03-25 04:23 · Score: 1

One of the first optimizations you learn when writing Java in a moderate load environment is to use StringBuffer or StringBuilder when concatenating Strings. There is probably a similar construct in Python. The test was not written from a place of experience or was purposely constructed this way to prove their pre assumed point.

Wow this was a waste of paper... by aethelrick · 2015-03-25 04:24 · Score: 1

The research tells us that repeatedly concatenating strings together is a bad thing... WE ALREADY KNOW THIS!!! good grief, who taught these guys to code? The title of their paper "When In-Memory Computing is Slower than Heavy Disk Usage" implies heavy disk access where none exists. They actually go on to point out that it's the OS doing magic things that helps out. i.e. it's the OS using RAM to buffer the disk that keeps your app speedy. So erm... memory being used instead of disk then... the exact opposite of their claims

misleading by theendlessnow · 2015-03-25 04:24 · Score: 1

I'd argue it's always faster to do things in memory. In the case presented here they were *not*. In both cases being compared they were writing to disk. All they did was determine the better way (for their case) to write to disk.

Re:Color me surprised by greg1104 · 2015-03-25 04:25 · Score: 1

And have a trivial exploit as a result too. It's a good thing that people who write bad Python and Java code are using those languages.

Re:This is the dumbest research I've seen this yea by Lunix+Nutcase · 2015-03-25 04:26 · Score: 1

It's even more simple than that. Their "writes to disk" are just being stored in disk cache hence the "faster" speed. On the other hand, they do basically the most inefficient in-memory operations possible.

Re:Obvious by freak0fnature · 2015-03-25 04:28 · Score: 1

Agreed. It doesn't even mention using StringBuffers or anything else that is designed to increase the performance of String concatenations, or even simply using a 1MB byte or char array. Not only did they fail in their programming, they failed to understand that the disk has its own memory, and their 1MB string is small enough to fit there before being physically stored on disk.

Re:bogus 'article' by bobbied · 2015-03-25 04:28 · Score: 1

1. arXiv paper - not peer reviewed 2. authors never mention caching, buffers, any kind of actual technical details 3. for the Java code they use 'BufferedWriter' ... oh I wonder where their 1MB of data is going to 4. plots done in MS Office => the paper is complete and utter crap and would not pass muster with any reviewer on any C-rated conference or workshop

You forget to add that two of the three contributors are BIOLOGY majors... What are they doing writing Java code for an academic paper?

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101

Re:O(n^2) in memory slower than O(n) writes to dis by grimmjeeper · 2015-03-25 04:29 · Score: 1

In other news, Pope found to actually be Catholic. The Pontiff was quoted as saying. "I always knew I was Catholic from when I was a little boy."

This just in. Massive government study shows bears do defecate in the woods. Head of the $65M (£43.6M) government funded study, Dr. Hans Schmidt, describes the study. "Ve always knew ze bears did zeir business somevere but ve were never sure vere zey did it. But now it is confirmed. Zey do zeir business in ze woods."

I skimmed the paper and they're morons. by hey! · 2015-03-25 04:34 · Score: 1

Here's the relevant bit:

long startTime = System.currentTimeMillis(); for (int i=0; i

So if numIter is one million, they're generating and throwing away a million temporary objects, some of them quite large. No competent Java programmer would write a tight loop this way.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.

The experiment does not test the postulate by QuietLagoon · 2015-03-25 04:35 · Score: 1

In both cases of the experiment, the disk is used, therefore the experiment cannot test whether doing something in memory is faster than doing something on disk.

.
It's a flawed experiment.

Re:Rubbish by topology · 2015-03-25 04:39 · Score: 1

Assuming your posts was not so tongue-in-cheek that you lacerated your cheek,

"Scientists" deserve respect only if they do good science. Shitty science means they are shitty scientists and deserve to be repudiated.

VFS by goarilla · 2015-03-25 04:42 · Score: 1

I hope thjis is not just the kernel VFS and cache doing a better job with their memory management than the researcher's code.

They went full retard. by duhorg · 2015-03-25 04:45 · Score: 2

Yep, many commenters here got it right: the "study's" authors are doing teh-stupid operations "in memory". This one is so egregious, especially that the ITworld article author fell for it, that I felt it warranted its own dissection post: http://blog.duh.org/2015/03/wh...

Bogus by Anonymous Coward · 2015-03-25 04:55 · Score: 1

Their python results are bogus. After I fix the n^2 string aglorithm I get these results:
in-memory: String took 0.0522561073303, file took 0.000552892684937
disk-only: file took 0.184334993362

Before I get these results:
in-memory: String took 93.5676689148, file took 0.000598907470703
disk-only: file took 0.180480003357

Don't do stupid, incompetent, bone-headed things and your program won't be slow.

I changed:
for i in range(0, numIter):
concatString = addString + concatString
To:
concatString = ''.join( addString for i in xrange( 0, numIter ) )

Of course you can do better:
concatString = addString * numIter
Produces:
in-memory: String took 0.000266075134277, file took 0.000687122344971
disk-only: file took 0.194098949432

Stupid is as stupid publishes.... by TiggertheMad · 2015-03-25 04:58 · Score: 5, Insightful

I just scanned the paper, because their claim seem to be idiotic. It looks like they are appending a single byte on the end of a string in memory and on disk. For the memory operation, this will result in a string copy since strings are immutable, vs. doing a one byte file append onto the disk. The former is increasingly expensive and the latter is a fixed cost, so after infinite operations, the disk cost becomes far less than the memory operation. If this is indeed their claim, and I am not missing something, then they should be collectively slapped for wasting our time by writing this paper. If this is really your use case, write some proper data structures to manage your data in a sane fashion.

So yes, if you do stupid things, you can make bad engineering decisions look like good ones.

--

HA! I just wasted some of your bandwidth with a frivolous sig!

Re:Stupid is as stupid publishes.... by Bengie · 2015-03-25 05:43 · Score: 3, Insightful

They should follow best practices and use StringBuilder and rerun their tests.
Re:Stupid is as stupid publishes.... by Dr_Barnowl · 2015-03-25 06:03 · Score: 1

It's such a common case that Java will usually optimize looped string concat into a StringBuilder now. I imagine Python does the same thing.
The performance probably still sucks because the buffer needs re-allocating periodically (in StringBuilder, it doubles each time), but not as much as it would in VB6 which has no such cleverness in it's immutable string type. There's also all the garbage collection overhead for all those discarded byte arrays.
Would be more interested to see a benchmark if they declared the StringBuilder with a buffer of the size they expected to use.
Re:Stupid is as stupid publishes.... by Mark+of+the+North · 2015-03-25 06:45 · Score: 1

I've actually worked with one of the authors. Nice guy. The analysis in the paper is so shallow that my guess is that it was primarily done by a graduate student, probably as part of a course but never successfully peer reviewed. I sure hope it wasn't accepted anywhere.
Re:Stupid is as stupid publishes.... by msobkow · 2015-03-25 06:53 · Score: 2

Java's "StringBuffer" object can deal with concatenating source code fragments to produce 6 million lines of code in under 8 minutes and write it to a 7200rpm HDD on Linux. Java handles string concatenation quite efficiently if you're using the proper data objects instead of naively doing actual string concatenations that require much more buffer re-allocation than simply extending the end point of a buffer that is periodically reallocated with n extra bytes each time. And that's only on a creaky old P4 3.8GHz with DDR2-800 memory.
I call "bullshit" on the paper.

--
I do not fail; I succeed at finding out what does not work.
Re:Stupid is as stupid publishes.... by Dastardly · 2015-03-25 07:00 · Score: 3, Informative

It is even worse than that. They are using a BufferedWriter for the so called writing to disk portion. So, they are actually comparing the worst possible way to append a String in memory to appending bytes to a bytes buffer and periodically writing that to disk. So, basically comparing two different in memory string appending techniques? When you bring the OS into play it is even less likely to actually show anything having to do with the disk because the OS will write asynchronously.
My grade: F- and they should be mocked mercilessly until the paper is retracted for being idiotic.
Re:Stupid is as stupid publishes.... by ndykman · 2015-03-25 07:18 · Score: 2

I strongly encourage people to email the authors and clue them in. Seriously, this makes me angry. If CS doesn't already have a reputation for being completely academic and out of touch. Things like this, no wonder people think you can learn to code in 10 weeks.
Re:Stupid is as stupid publishes.... by PIBM · 2015-03-25 07:33 · Score: 1

They chose to use a 2006 version of java so that it would not be optimized away perhaps ? It looks like a troll article to me.
Re:Stupid is as stupid publishes.... by BarbaraHudson · 2015-03-25 07:34 · Score: 1

The other issue is that the OS and disk cache will buffer those one-byte writes, whereas they went out of the way to use the worst code possible for in-memory operations. Appending 1 byte a million times creates and destroys 1 million instances. If they had just created an array and written each byte at the appropriate offset, there's only one instance. A lot faster.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:Stupid is as stupid publishes.... by Darinbob · 2015-03-25 07:51 · Score: 1

Trouble is, people are taught things but they don't actually learn. Instead they take what they are taught as "the truth" and never actually thing about what's going on. Rules of thumb become dogma.
This isn't just from newer schools who only teach job training courses for high level scripting languages, this effect also happened even when schools taught from the low level chip design all the way up through theory.
Re:Stupid is as stupid publishes.... by Mark+of+the+North · 2015-03-25 11:11 · Score: 1

None of the authors have a CS background. Two are products of Electrical Engineering, one of which is a professor (whom I worked with at one point). The other is from the Department of Biological and Chemical Engineering.

True for certain database operations by avandesande · 2015-03-25 05:03 · Score: 1

In SQL server, large memory table objects can be slower to join with or access than temp tables due to lack of indexing. I am sure you could find other cases where memory might perform worse than file access.

--
love is just extroverted narcissism

What morons by m.dillon · 2015-03-25 05:06 · Score: 1

What morons. Sorry, but they are. They are writing to a file through the operating system which means that it is being spooled out to disk asynchronously, so obviously piecemeal writes are going to be faster because they will run concurrently with the string generation algorithm. Plus their 'writes' are probably being buffered in ram anyway.

Writes to files generally do not stall programs. These people are morons.

-Matt

python and java by Spazmania · 2015-03-25 05:10 · Score: 4, Informative

They tested using strings in python and java, both of whose string libraries are very much overweight. And they tested by concatinating strings in a way that requires constant reallocations and memory copies versus pushing data to fixed size disk buffers in the OS cache.

So... surprise! When writing data sequentially the C implementation of disk buffers is faster than the java and python implementations of strings.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Re:python and java by Just+Some+Guy · 2015-03-25 08:26 · Score: 1

Python's string library isn't remotely what I'd call "overweight", but its strings are immutable. Some algorithms that are quick in other languages are slow in Python, and some operations that are risky in other languages (like using strings for hash keys) are trivial (and threadsafe) in Python. But regardless of the language involved, it's always a good idea to have a bare minimum of knowledge about it before you do something completely stupid.

--
Dewey, what part of this looks like authorities should be involved?

Does Slashdot have editors? by Afty0r · 2015-03-25 05:10 · Score: 1

Can we reject things like this please?

Or possibly have a "-1 written by an idiot" mod option, and enough of them removes it from the front page?

Painful dribble- but expect a LOT more of this by Anonymous Coward · 2015-03-25 05:10 · Score: 1

Now that having a vagina is a legal NECESSITY for any coder receiving funding from a US state facility (or from a major politically correct IT company), you can expect an awful lot more GARBAGE flooding the annals of Computer Science.

I am reminded of the sickening NASA press conference, that announced a totally new form of life on Earth. A BRAIN-DEAD female, employed by NASA purely because she was female, had been given a team of mostly male science slaves, to pursue her NONSENSE theory that because she remembered the stupid simplification of the Periodic Table during her High School years (when teachers over-state the similarity of elements in the same 'group') DNA variants must exist in nature where one element of usual DNA would be replaced with ANOTHER element from the same group. This is a kin to saying that because a water-wheel is technically related to a gear, you might expect to find water-wheels within a mechanical watch. The 'science' she had used to 'prove' her hypothesis was the most humiliatingly awful pile of garbage, but her male colleagues had been forced by the upper echelons of NASA to claim her a 'genius'.

When ability is replaced by political correctness, and an anti-nerd program is implemented in most US schools and universities, standards don't just fall through the floor, but one enters an 'Emperor's New Clothes' phase in the USA, where anyone who wishes for any form of success MUST lavish praise on the ravings of cretins that the politically correct establishment is currently presenting as figure-heads.

Now this isn't an anti-female point. To the contrary, because of the bias towards males in engineering, the VAST majority of incompetents I have encountered have been males- some with significant power and/or influence. However, this is common in any system (scum rises) - and is different from a system FORCED to bias itself to selling one type of Human over another.

Bill Gates and Rupert Murdoch's 'Common Core' (don't forget, these two monsters- supposedly from either end of the political spectrum- are actually CLOSE friends- and work together on the 'new' education projects in the USA, like the now at the NSA 'inBloom' obscenity) initiative is already dumbing-down education in the USA, and placing maximum hurt on the male Beta experience in technical subjects- especially maths. Put simply, Common Core is designed to ELIMINATE the influence of natural Alphas in the classroom, and place 100% of the intellectual authority in the hands of teachers - adults that frequently have such poor skills in their 'subject', they actually fall BELOW the Beta designation. The underlying idea is that non-alpha kids will have too little innate confidence in their own skills at that age to challenge the system, and that the alphas will 'withdraw' and simply succeed with ex-classroom activity.

PS as for the hopeless article promoted by Dice (no accident there), the 1MB data set hits the common L2 cache limit by no co-incidence. The concept of 'memory' on a modern computer is a complex chain of 'L' caches, where the RAM of the system is actually L4 cache, the SSD drive should be used as L5 (but rarely is), which would make the hard-drive L6. Normally, under a properly designed memory-management system, one wouldn't be explicitly writing anything to the hard-drive UNLESS it was to create or alter a permanent HARD-file.

The speed of a modern CPU cluster (4+ cores in a modern work-horse PC) is so much faster than the data rate to a HD physical platter (as opposed to the RAM cache of the HD), it isn't funny. But usually when doing explicit HD writes, on accepts on that thread that the thread is going to be limited to the speed of communication to the HD. That thread, of course, will not even dominate the run-time of its own core since each IO stall will allow another thread to take its place until the HD has finished its current DMA activity.

The computer is NOT a black box, and coders who treat it as such, because their knowledge comes from politically correct teaching, havin

HOT BREAKING NEWS! by Alsee · 2015-03-25 05:12 · Score: 5, Funny

NEW SCIENTIFIC DISCOVERY!
For n equal to one million, an O(n^2) algorithm is slower than an O(n) algorithm. Even when the O(n^2) algorithm is run in RAM, and the O(n) algorithm is disk writes being buffered and optimized by the operating system.

I'll take my Nobel Prize now, thank you.

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

But file writes are buffered! by ggraham412 · 2015-03-25 05:12 · Score: 1

So in a nutshell, they compared the cost of the most inefficient method of in-memory string concatenation versus an efficient method of in-memory string concatenation: the memory in the latter case being a buffered file writer. Lololol!!!

Re:This is the dumbest research I've seen this yea by Jaime2 · 2015-03-25 05:13 · Score: 2

It's dumber than that. They didn't even do it right in Java. There is a note near the end of the paper that says "However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results". They didn't present the numbers, but what they really meant was "The performance problems we saw were entirely due to our not using StringBuilder or StringBuffer, this paper shows no meaningful difference in performance between memory-then-disk and disk-only access once the algorithm is fixed."

This is RESEARCH? by orlanz · 2015-03-25 05:21 · Score: 2

Ok, I read all the other "This is stupid" comments and my jaw kept dropping. I actually felt this was an April fools thing or something similar and that we were all missing something somewhere (and please let me know if I am... I REALLY need to know). I HAD to read the article and underlying paper, cause I just couldn't believe the absolute asinine stupidity of the test, let alone that it was being presented as research, or that the test itself was so flawed! So after all that, had to post. Summary for others, adding my voice to the crowd.

----------------
Assumption: Software Developers avoid disk access cause they believe doing it in memory is faster. This is put in context of BI and bigdata.

Testing: Create a program representing a common task that can be tested where one uses memory and the other uses diskspace.
Memory Test:
1) Create a string in memory.
2) Add it multiple times into another string
3) Write second string onto Disk
4) Flush writes

Disk Test:
1) Create a string in memory
2) Write it multiple times to Disk
3) Flush writes

Create code in Python and Java.

Conclusion: Memory Test is so much slower than Disk Test! Additionally, the languages used have certain quirks to make it worse. Optimization helped a little but only on Linux. Therefore, programmers should reassess and understand their OS and programming languages before assuming this belief which is not true.
---------------

Assumption & Testing idea... very good. I would have loved to know the unknown scenarios where this assumption should be questioned. Especially in the world of click&drag programming for workflows, ETLs, and report writing.

But from there... its all BS and stupidity. Basically the test tests if replicating the hard drive driver in memory and then using the driver to write to disk is faster than just using the driver to write to disk. Are you bloody serious?!?! That's like testing if 2+2 is greater than 2+0. And that is before we start looking at using Java and Python which do a ton of work in terms of memory management and build all types of stuff around data types. Before the fact that they wrote the Python code WRONG (that's the slow way of doing string or listing concat). So they picked languages that write in memory O(n) extra times for the same data.

This test would have come to the same conclusions in C, C++, or Assembly! But the folks wouldn't have been able to write code to see the micro second time differences.

So lets set the record straight. NO developer out there goes out of their way to just write to a memory file if its simply going to flush to disk. Its not worth the extra lines of code, nor the lost CPU cycles in reading them. Especially since most operating systems do this already at multiple points along the data chain at the very low hardware & driver levels! If we have developers like this, we have a ton of bigger problems in software development than this little thing that will be solved by money.

To test this belief properly, give me a scenario where you reuse the written to disk/memory stuff, transform it, and then write to disk. See which one is slower. If its written properly, you will see that the underlying hardware systems will actually store stuff in cache or memory for you to help you speed it up! If you find proper scenarios where the memory part is slower, please let us know cause that is actually adding to the IT body of knowledge.

God, as this was BigData related, I was hoping at least something along the lines of "In DB data processing and extract vs extract and client side processing". Give me the points along a curve where one is better/worse than the other. THAT would have been interesting.

Depends on the situation by MagickalMyst · 2015-03-25 05:26 · Score: 1

"...alternative ways to create a 1MB string and write it to disk"

This is not surprising if the goal is to write something to the disk.

But what if you were to write something to the screen instead?

Would it be any faster to create a 1MB string on the disk and then display it on the screen? Probably not.

Not to mention that writing to the screen really is writing to memory; as opposed to a disk which is a slower, physical medium.

So really, like anything it depends on what you are trying to do.

It may not always be quicker to do things in memory, but it usually is.

--
Political correctness is really just herd psychology pushed by insecure people who desperately seek social conformity.

Strange by nospam007 · 2015-03-25 05:26 · Score: 1

Can't be.
I just ran a dBASE III test with my first harddisk I ever got, a full height 20MB Seagate and memory always won.

Re:This is the dumbest research I've seen this yea by falzer · 2015-03-25 05:26 · Score: 1

>This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!

Agreed with you on the uselessness of their research, but that is most definitely one important and common use of memory: buffer caches used by the operating system.

Effectively, they unintentionally tested the speed of the OS to concatenate strings vs Java or Python. The researchers are wrong right out of the gate: they say "Heavy Disk Usage" in their research headline, but at no point did they actually test disk performance, everything they did is being handled by the OS buffer cache.

All the researchers have shown is that string concatenation operations in Java and Python are atrociously slow. The java example used the naive form a=a+b; to concatenate strings, which is one of the slowest ways to do it in Java if you are doing repeated concatenations to a string.

If, in their tests, they had also done a string concatenation in C by allocating a buffer and appending to it using a pointer (not strcat) the speed difference doing that vs. 1 million write calls would have been negligible.

Also, if they sync'd after each of a million 1-byte writes to test how slow "Heavy Disk Usage" is compared to a single write of a million bytes, they wouldn't have bothered finishing this paper at all because it's so damn obvious that memory is faster.

We're all doing it wrong! by jetkust · 2015-03-25 05:29 · Score: 5, Funny

Maybe we should store our files in memory and load them into the harddrive to do calculations.

Re:We're all doing it wrong! by hcs_$reboot · 2015-03-25 21:00 · Score: 1

Maybe we should store our files in memory and load them into the harddrive to do calculations.
And the swap would be also in memory... What a fast swap we would get!

--
Slashdot, fix the reply notifications... You won't get away with it...

Re:This is the dumbest research I've seen this yea by MerlynEmrys67 · 2015-03-25 05:32 · Score: 1

What do you expect... There is only one person that MIGHT have a computer background on the paper... This is pure academic fluff, compare an in-memory database with a spinning rust based database and see how many operations you can get out of each one performing the same operation.

--
I have mod points and I am not afraid to use them

No way, unless extreme incompetence is employed by gweihir · 2015-03-25 05:34 · Score: 1

Unless you really mess up your in-memory variant, there is no way in this universe that disk access can be faster. Here is a simple thought-experiment that shows this: Just use an in-memory array as storage vs. the disk. The array must be faster as RAM has orders of magnitude better bandwidth and latency as even the fastest SSDs, and it has far, far smaller block-sizes in addition.

Of course, if your in-memory data storage is so badly organized that the OS in-memory (!) buffer-cache does a better job, you may think that you observe the disk being faster than your RAM.

Looking at the paper header, the authors are all from a biology-department, so the suspicion that they are clueless of how to write things efficiently is really not far-fetched. I will not read the paper, it is likely a waste of time.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:No way, unless extreme incompetence is employed by viperidaenz · 2015-03-25 07:56 · Score: 1

Don't forget that to write to disk, you must store the data in memory first.
Re:No way, unless extreme incompetence is employed by gweihir · 2015-03-25 19:32 · Score: 1

Indeed.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Wow who would have thunk it!? by nedlohs · 2015-03-25 05:44 · Score: 1

Writing lots of small strings to disk one by one, is faster than building an in memory string up from those small strings one by one - reallocating a larger chunk of memory and copying the entire current string on each little step.

It's almost like memory allocation and copying data over and over might slow things down a bit over simply not doing that at all.

Honestly the paper is completely retarded. Only a moron would not expect that result.

Caching explains much of the difference by davidwr · 2015-03-25 05:47 · Score: 1

The results were poisoned by the presence of various caches affecting disk I/O and for that matter memory I/O. On some modern systems, either the disk lies to the computer or the OS lies to the application and the application thinks the data is actually stored on the bare metal before it is really stored (the data may or may not be stored in a "safe" place like a non-volatile cache - the point is that a small write operation returns "success" very quickly, much faster than if it had to wait for the bits to be written to the platter).

The only thing they can really say is "on this hardware, using this operating system, under this workload, these are the results of our experiments."

I'm not saying their results aren't useful - they are. Instead of presenting this as "memory writes are faster than disk writes" they should say "in some or many modern systems, under some circumstances, it may be more efficient for programs or operating systems to write to external storage devices in small bits rather than going to extra work to minimize the number of writes to such devices. Don't assume that what was true about the performance of an application calling an operating system to perform a disk-write operation or of an operating system asking a hard drive to perform a disk-write operation is the same now as it was a decade or two ago."

Just don't call them "disk writes." Call them what they are - "requests by the application or the OS to the OS or hardware to perform a disk write."

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Is this an onion article? by Anonymous Coward · 2015-03-25 05:56 · Score: 1

Non peer-reviewed research should be published in the onion, at least this way we all know it's a joke, and news sites would stop posting this garbage as real research. The summary of this paper is you can write terrible code that makes in-memory look worse than disk because you don't really write to disk, the os does, and you don't know how to actually measure in-memory vs disk access.

Re:Chemists and Biologists by Rakarra · 2015-03-25 06:00 · Score: 1

The problem I have with their methodology performance is that it seems like they came to a conclusion first, then wrote a test that would support that conclusion. We might roll our eyes and move on, but that's a poor way to conduct biology and chemistry research as well.

I'm actually happy this story was posted to Slashdot, because it has the side effect of illustrating a number of cautionary tales about how to do things, which I don't think the submitter or editor had in mind when approving it.

Don't over-optimize by Immerman · 2015-03-25 06:00 · Score: 1

Correction: Dumb concatenation is for situations where clarity and convenience are more important than performance. That is to say, most of your code. If you're using a buffered string building system to perform a simple concatenation of file name components before opening the file to begin the real work, then you're creating maintenance headaches for an irrelevant performance boost. In an inner loop dedicated to string manipulation though...

One of the most important aspects of programming is knowing how to choose the right tool for the job - typically 80-90% of your codebase is executed so infrequently that even abysmal performance won't be noticeable to the user, and optimizing that code will actually be counter-productive due to the time and maintenance overhead it imposes. Moreover, most programmers (and especially the inexperienced) are quite poor at recognizing which parts of their code are actually performance critical. Recognizing that, it may often make sense to default to doing things the simple, slow, way and only worry about optimization after performance analysis has highlighted the bottlenecks. Personally I'll often write the expected hotspots with some basic "low-cost" optimizations (like StringBuilder) and an eye towards making future optimization simple (expected optimizations should never require an architecture overhaul, and maybe it's worth passing in certain data that will be necessary for future optimization of a heavily used function, just to avoid having to re-write a million function calls if/when it comes time to optimize it), but primarily focus on correctness. Then, after everything is working correctly, and performance analysis has shown me which bits of code are actually causing issues (including the bits I never would have suspected), then I go back and break out the heavy guns. I've gotten burned too many times over-optimizing the wrong things and generating headaches down the road to make a habit out of it anymore. Unless it's something fun to implement of course...

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.

What an Idiotic Paper by rrr00bb5454 · 2015-03-25 06:06 · Score: 1

Holding the string in memory serves no purpose at all if you are just appending to it. Frankly, this += of strings issue is the most common "Smart but Green Self-Taught" versus "Computer Sci grad" problem you will see with new hires. Appending strings can be O(n^2) when the strings are immutable, and it applies to most high level environments. Even Metasploit had this issue at one point, and it was written by some very smart people. So everybody learns to keep appending to a list, and then flatten it to a string at the end. But this tie in with disk just makes the paper totally dumb. If you won't be reading the queue of string chunks, just flush them to disk immediately so that the code runs in constant space - relieving the memory allocator.

Repeated Java string concatenation is O(n^2)! by FizzyP · 2015-03-25 06:11 · Score: 2

Concatenating strings one character at a time in Java has QUADRATIC performance (i.e. O(n^2)). If they used the StringBuilder class instead I bet most of their bottlenecks would disappeared. With that class it should be amortized O(n).

Re:Repeated Java string concatenation is O(n^2)! by FizzyP · 2015-03-25 06:14 · Score: 1

Correction StringBuilder performance would be amortized O(n log n)

Re:Chemists and Biologists by Immerman · 2015-03-25 06:18 · Score: 1

Sure, except that had they used buffered string concatenation, so that the implementation was actually remotely comparable to the buffered disk access, they would have gotten radically different results.

As it is, they've only actually showed that doing things in basically the slowest possible way in memory can be slower than writing to the disk cache (also entirely in memory, they never initiated an actual write to disk at all, only queued it)

--
--- Most topics have many sides worth arguing, allow me to take one opposite you.

Your code sucks by viperidaenz · 2015-03-25 06:21 · Score: 2

String concatString = "";
for (int i=0; i numIter; i++) {
concatString += addString;
}

That's going to create 1,000,000 StringBuilder objects, use them to append a single String each, and allocate 1,000,000 new String objects as well

StringBuilder builder = new StringBuilder(
for (int i = 0; i numIter; i++) {
builder.append(addString);
}
String concatString = builder.toString();

I bet $1,000,0000 that code is faster.

tl;dr; Researchers who don't know who Java works suck at writing Java benchmarks.
String a = b + c;
gets translated by the compiler to something like:
String a = new StringBuilder(a).append(b).toString();

It's creating a new StringBuilder object, its member variables including a char array, it copies the String passed in to the constructor. Append is probably also expanding the array, which means creating a new array and copying the old one to the new one, then copying the data from b to the end of the new array.
toString then creates a new String object, copying the data again.

If you write shit code, you get shit performance.

Re:Your code sucks by t-wata · 2015-03-28 01:52 · Score: 1

Also he uses BufferedWriter for "write to disk" test. If he uses FileWriter and your code, then he would get opposite results.
Re:Your code sucks by viperidaenz · 2015-03-29 06:25 · Score: 1

Depends. That may only lower the overhead associated with copying the data from the user process over to the kernel. The kernel may still buffer the writes to disk.

This illustrates why PhD's shouldn't be in busines by Tony+Isaac · 2015-03-25 06:29 · Score: 1

Of course, there are exceptions. But many PhD's I've known make lousy programmers, in terms of producing good software.

I've come to think that the skills needed to be a good post-graduate student are different from the skills needed to be a good professional developer.

Professional developers know (or should know) how to optimize code, when necessary. All else being equal, optimized code will ALWAYS be faster in memory than on disk. The two examples in this research are NOT equal. A more equal test would be to output to a memory stream, vs. a file stream. I'll bet the results would be quite different.

How about a string of 9's? by jdavidb · 2015-03-25 06:38 · Score: 1

How long to create a string of 9's?

--
Secession is the right of all sentient beings.

This is why love C by aglider · 2015-03-25 06:39 · Score: 1

In C you always know where the pitfall is. Or at least you have chances to really know.

--
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.

Re:This is why love C by hcs_$reboot · 2015-03-25 21:01 · Score: 1

A bad algorithm in C may perform worse than a good algo in Perl. Fortunately the C programmers do usually know what they're doing.

--
Slashdot, fix the reply notifications... You won't get away with it...

Send it to Linus Torvalds! by Paul+Mallako · 2015-03-25 06:42 · Score: 1

Could someone please send this paper to Linus Torvalds? I'd like to hear his opinion on this paper. =)

Cherry-picked example by jtara · 2015-03-25 06:44 · Score: 1

See subject.

Re:This is the dumbest research I've seen this yea by c · 2015-03-25 06:49 · Score: 1

Their "writes to disk" are just being stored in disk cache hence the "faster" speed.

According TFA, they actually do an explicit sync to disk at the end of the writes. So it's not purely writing into cache.

--
Log in or piss off.

Free performance boost by StikyPad · 2015-03-25 06:50 · Score: 1

I was about to upgrade my hardware, but instead I just pulled all my DIMMs and I'm only using virtual memory now. My computer is like a million times faster, and I think it even got rid of some viruses that were hiding in memory.

Now if I could just figure out why that goddamned System Idle Process is using so much CPU time!!!!!

--
https://www.eff.org/https-everywhere

So, this is how bad research can get... by ndykman · 2015-03-25 06:53 · Score: 1

Awful. So, in a language with immutable strings, building a string like so:

for (int i = 0; i < 1000000; i++) { str += "1"; }

is really slow, but if you use a file buffer like so:

for (int i =0; i < 1000000; i++) { fileBuffer.write("1"); }

it's much faster. Wow. No kidding. Also, note, they don't flush until the end. This is laughable. No wonder CS programs are under attack if this is the kind of thing that people think they can publish.

Re:So, this is how bad research can get... by SLOGEN · 2015-03-25 09:06 · Score: 1

They seem to be from a biology department, not CS :)

--
SLOGEN [ http://ungdomshus.nu : Sebastian cover music]
Re:So, this is how bad research can get... by ndykman · 2015-03-25 09:40 · Score: 1

Well, one author was listed as being from a Electrical and Computing Engineering department, and I would expect he or his peers would be able to see how pointless this is.

Stringbuilder? Perhaps..... by TiggertheMad · 2015-03-25 06:53 · Score: 2

Many people are suggesting using string builder, as a easy fix...If you think about this problem, that doesn't solve it as you approach infinite operations, it just pushes the cost crossover point way out (possibly beyond the limits of existing hardware, so it might be practically moot). Since they are doing silly comparisons like this, I would suggest just writing a linked list to store each byte as a counter example that will provide more of an apples to apples comparison. Adding an element to an linked list will have a fixed cost, just like appending a byte to disk will, so after infinite operations, you could demonstrate that memory operations are always going to be faster performing similar tasks when the IO time of memory is faster than disk IO.

--

HA! I just wasted some of your bandwidth with a frivolous sig!

Newsflash: Buffering works! by ggraham412 · 2015-03-25 06:55 · Score: 1

That's what buffered files are supposed to do: make slow disk writes appear as if they are as performant as memory writes.

Facepalm by lorinc · 2015-03-25 07:12 · Score: 1

This kind of useless paper is exactly why idiots should not be allowed in computer science. They even give the explanation in the paper and still draw to bad conclusion. To me, it should be renamed "Bad programming habit performs worse than very bad performing habit in the absence of knowledge about the tool used".

--
Video of some good progressive thrash music

Re:This is the dumbest research I've seen this yea by Xyrus · 2015-03-25 07:14 · Score: 1

Agreed. This has got to be some sort of April Fools joke. There's no way this is a serious piece of research, much less actually pass a peer review process. Even a junior level programmer could tell you how stupid this paper is.

Some tips for the authors of this travesty:

1. Learn how computers work.
2. Learn how operating systems work.
3. Learn how programming languages work, especially ones that are interpreted or VM/CLR based.
4. Learn2code.

This "research", if it actually is research, should be nominated for an Ignoble Award. This would deserve an F even in an intro to programming course.

I think I'll go write a paper on how having lots of polygons in a 3D model will slow rendering down. I should get two Ph.Ds for that work.

--
~X~

Selected the wrong datatype = poor programmer by CraigCruden · 2015-03-25 07:34 · Score: 1

So they basically selected a bad datatype and wrote a very inefficient program to handle manipulation of data and they use that as the basis to say that memory was the issue. The issue was programming without thought to what the computer was actually doing. Is this what these two Universities are teaching their students? Were they being purposely bad programmers to prove a point?

God help the world if these people ever have to program efficiently....

How the mighty have fallen by ChaoticCoyote · 2015-03-25 07:38 · Score: 2

Slashdot has fallen far in credibility if it promotes sloppy research like the referenced article.

--
All about me

Re:How the mighty have fallen by hcs_$reboot · 2015-03-25 21:03 · Score: 1

Promotion for discussion, a flamebait article that brought ~500 posts AON. Not bad.

--
Slashdot, fix the reply notifications... You won't get away with it...
Re:How the mighty have fallen by houghi · 2015-03-25 22:05 · Score: 1

And the 4 minute rule doesn't help to have a discussion either. That is why you see in the history 1000+ message discussion where we now get hardly to 500 on very hot toppics.
Basically it means we are here just to click on the ads and shut up for the rest.

--
Don't fight for your country, if your country does not fight for you.

Re:This is the dumbest research I've seen this yea by falzer · 2015-03-25 08:01 · Score: 1

>According TFA, they actually do an explicit sync to disk at the end of the writes. So it's not purely writing into cache.

The code in the paper says they flush before closing the file. This is not the same as a sync. They don't even flush (or sync) after each write.

Reading helps by Kleanthes · 2015-03-25 08:08 · Score: 1

While it is nice to point out that the code sucks, they do already know that. They mention StringBuilder on page 6, for example. No problem there. But of course, in the end, the paper has nothing to do with disk vs. memory. It's about comparing one O(n^2) algorithm to an O(n) algorithm and determine that one of them is quicker. The problem is, that it has nothing to do with where you save your data. The basic point, that bad code makes stuff slow is true, but, if you ask me, told quite confusingly.

April 1st is a bit early this year! by SLOGEN · 2015-03-25 09:11 · Score: 1

Wow, April 1st came early this year,... although I can't spot any obvious prank names....

I sincerely hope this is a prank, even if it's apparently from a biology department.

--
SLOGEN [ http://ungdomshus.nu : Sebastian cover music]

Re:April 1st is a bit early this year! by hcs_$reboot · 2015-03-25 21:05 · Score: 1

Unfortunately that's serious "research". An article made from people who learned how a computer works through java and python. This is what you can expect.

--
Slashdot, fix the reply notifications... You won't get away with it...

FML by krkhan · 2015-03-25 09:12 · Score: 1

I opened the link and found that I share my first name with the first author. How am I supposed to live with this now?

Re:This is the dumbest research I've seen this yea by brausch · 2015-03-25 09:14 · Score: 1

Even worse, they were not just concatenating the strings in memory. They were making a new string each time and copying the old one first, then concatenating. Their choices of computer languages and their lack of understanding of those choices makes this a problem.

--
"Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana

Junk science by compudj · 2015-03-25 09:25 · Score: 1

Wow. Has anyone heard about buffered writes ? And does kernel-level page cache ring a bell ? No fsync was ever used in the benchmarks, therefore, it is never actually hitting the disk. The only good thing about this paper is that the Java and Python listings are available at the end, for everyone to identify the basic flaws in this research.

So yeah, it's faster to write directly to MEMORY than to do a copy before writing to MEMORY.

Re:This is the dumbest research I've seen this yea by c · 2015-03-25 09:30 · Score: 1

Well, that's pretty lame then. They did say "sync" in the paper, but I didn't get to the actual code since, quite frankly, I was being blinded by the daylight already coming through the holes in the rest of the paper.

--
Log in or piss off.

Fixed by Hardhead_7 · 2015-03-25 10:51 · Score: 1

http://pastebin.com/wJuWeAiN

In-Memory takes .1 seconds.
Writing to Disk takes .4 seconds.

No programming knowledge required to debunk this by wendyo · 2015-03-25 10:53 · Score: 1

Just look at their chart, they are comparing apples to oranges. Their "memory test" is shown as string concactination time plus write to disk time [they break it into two columns]. Disk test is write to disk time.

This is like saying it takes longer to walk to the store if you make a pair of shoes first than if you just walk to the store.

java string concatenation is O(N) by bingoUV · 2015-03-25 11:21 · Score: 1

This "researcher" is an idiot. The java code given at the bottom of the "research paper" uses + operator to concatenate strings. This is O(N) in Oracle java. Total algorithm becomes O(N*N) in memory, and O(N) on disk.

Obviously N*N takes longer than N after a certain N even when N*N is running on faster memory.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.

More mod commentary required? by uarch · 2015-03-25 12:45 · Score: 1

I'm wondering if we need to find a means of enforcing at least some level of fact checking and commentary from the mods.
Simply re-posting this submission as is has turned into a giant flame fest because the research was crap. (As is a frighteningly large proportion of comp-sci & comp-eng research these days).

If the mods are decent then they should be able to take a moment to look through this, understand how much of a train-wreck it is and provide a bit of commentary to prevent the flame-fest or outright reject it. (I'm already on the site so a crap article isn't going to bring any additional add revenue. It will simply drive traffic elsewhere after the first few people identify it as crap.)

Yes, yes. I know. That isn't what our mods do...

Maybe it's time they started...

Not completely useless... by Lodragandraoidh · 2015-03-25 18:29 · Score: 1

They should have viewed this presentation about increasing a python data crunching application 114,000 times faster before they set off on their research project.

To summarize - there are a multitude of ways to optimize your application including using the chip's onboard cache to avoid the overhead/delay of accessing memory on the motherboard across the bus

Yes - as we try to eek out more performance from our applications - we'll need to consider the relationship between our applications and the underlying implementation and capabilities of the hardware it lives on. Further - I would say we also should be considering how to make our tools do this sort of thing for us. Given the complexities we are seeing in the development arena today, including virtualization, the need to do more with less both on the back end, as well as on small hand held devices, and the need to build more faster while increasing security of what we build, I consider it imperative.

--

Lodragan Draoidh
The more you explain it, the more I don't understand it. - Mark Twain

Has anybody here actually read the paper? by Anonymous Coward · 2015-03-25 19:00 · Score: 1

Based on the comments here, I doubt if many people actually read the paper with any amount of attention before joining the mob with pitchforks.

The authors clearly know Java Strings are not the best thing for concatenation. They even mention StringBuilder as a way to fix the performance problem. This is an example of how things can go wrong if not careful. It is great that so many slashdotters already know their Java Strings so well, but does everybody know their Python, Haskell, etc. too.? The paper is making a general point using a specific example.

A lot of posters mention the problem being comparing an O(N^2) algorithm with an O(N) algorithm. If you read the paper you'll notice that the string concatenation loop they use looks like this:
for (int i=0; i numIter; i++) {
concatString += addString;
}
The running times don't change in a quadratic way, because in their code numIter * length(addString) is always 1,000,000. As the outer loop gets bigger, the concatenated string gets smaller.

The difference in running times comes from the number of string concatenations (the outer loop), and it is clear that as the number of concatenations drops, performance gets better. This indicates that the problem is caused by the immutable String objects, which need to be reallocated and re-initialized. The more this needs to be done, the poorer the performance, as pointed out in the paper.

Speaking of complexity, does anyone know what is the complexity of their disk-only code below?
for (int i=0; i numIter; i++) {
writer.write(addString);
}
The for loop goes over N items, while write() must loop over the length of addString == O(N^2)? In any case, here again numIter * length(addString) is always 1,000,000.

I believe the paper is simulating creating a file in memory and then writing it to disk as opposed to writing the strings to disk as soon an the strings are generated. The alternative/clever ways of generating a string of 1,000,000 "1"s like concatString = addString * numIter are useless in this case, because in reality addString may contain unique data bits and pieces.

There are some unexpected results with the paper's Python tests. The considerable performance differences under Windows and Linux with the same code, or when rearranging a concatenation order are interesting. This paper is actually worth a careful read, even if most of us would never write code that heavily uses Java Strings concatenation!

Re:Has anybody here actually read the paper? by lu-darp · 2015-03-25 20:59 · Score: 1

Let the pitchforks fly!! A paper has to present something non-obvious and of value, this fails on both counts.
I'm just waiting for the next article "Newsflash: solving a sudouk puzzle by hand is faster than software - under certain conditions"

Java strings overweight? Citation? by kervin · 2015-03-25 22:40 · Score: 1

People just say things on here and it's taken as fact. How is Java's String implementation overweight?

Java, like C# does use 16 bit char widths, but that actually makes it faster than variable width characters. That's why these languages do so.

So what about Java strings are 'overweight'?

Re:Java strings overweight? Citation? by Spazmania · 2015-03-27 15:37 · Score: 1

Compared to how the C-based kernel handles its byte buffers after you've written to the file handle? You really need help understanding how java strings are overweight _by comparison_?

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

If you do it wrong in memory by fluffynuts · 2015-03-26 05:59 · Score: 1

Expect the disk to be faster. For the privileged few who bothered to RTFA, you'll understand. The high-level languages used tend to degrade on the specific instructions (string concat) as the number of ops increases; I also wager that this problem is more prevalent on Windows where the memory manager is about as good useful as a clown who thinks he has an eye for fashion. So, distilled, this article should read:

"Researchers find an obtuse way to defy a well-established rule-of-thumb".

Bravo. Or not.

this is NOT news by rewindustry · 2015-03-26 06:23 · Score: 1

the tests they show only serve to demonstrate how slow python string concatenation is.
they then go on to say that disk writes are bufferered more sensibly, and this is why they are faster.
the key thing to notice is that disk buffering happens in **memory**.

memory is still, and will remain, much faster.

you just have to use it properly.

I'll say it - Junk science! by ebvwfbw · 2015-03-26 11:45 · Score: 1

Put this paper in the dictionary under junk science. This is stupid.

Send them to Mars; by NewYork · 2015-03-26 15:51 · Score: 1

Send these researchers to Mars; They're very much needed there;

--
Casteism

ArrayList and StringBuilder use this by tepples · 2015-03-27 03:55 · Score: 1

I would realloc the buffer doubling the size each time it overflowed. This allocation strategy is simple, is bounded to 50% worst case overhead, and requires only log N reallocations for a maximum buffer size of N.

It also happens to be the policy used by Java's ArrayList and presumably by its StringBuilder.

Reminds me of calculating factorial with .bat file by renergy · 2015-03-27 09:33 · Score: 1

Reminds me of one really strange course on college. The lecturer calculated factorial, up to ten - using DOS 6.22 .bat file. Actually, he provided three "solutions".
One solution was to write ten "if" cases, and just echo the corresponding number, hardcoded.
I can't remember the second solution.
The third was the real "beast". It was based on recursion. "factor.bat" called itself. The batch created one byte file in the beginning. And this file was joined n-times within each iteration. All this to facilitate multiplication, which was not directly achievable in a batch file. In the end, there was "dir /b fact.txt" and probably an echo with "look at the size, this is the result".
I kid you not. It was something like: for $1==1, fact.bat created a file "fact.txt", with one byte (using echo x > fact.txt) then the file was joined n-times - with type fact.txt >> xfact.txt after fact.txt was added n times to xfact, xfact would be renamed to fact.txt
Of course, in this case, disk access was really slower than pascal version, that run in memory... :)

Slashdot Mirror

No, It's Not Always Quicker To Do Things In Memory

331 of 486 comments (clear)