No, It's Not Always Quicker To Do Things In Memory
itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).
Tl; DR:
They used python and java. Sort of hard to develop a meaningful thesis on general programming when you're that far up the abstraction stack. Who knows, maybe python and Java suck at memory management (GASP).
It's not even the choice of tools, they seem to willfully misuse the languages to get poor results.
If it's in you sig, it's in your post.
The language is not the problem, the code is terrible. They did String concatenation in the most expensive way possible. I'm pretty sure if you used a pre-sized StringBuilder it would be faster in memory.
They also make some very novice benchmarking mistakes.
This is actually a pretty good interview problem. Anyone who writes code like that should not be hired, even for a junior position.
Fixed their code by using a StringBuilder and moving the flush call inside the loop, so it actually writes it to disk.
The result:
In-memory mean: string time 0.008900000000000002
In-memory mean: file time 0.0034000000000000002
Disk-only mean: file time 1.1747
Yes, it's still quicker to do things in memory, you just have to do it right.
PS: with just one flush:
In-memory mean: string time 0.0091
In-memory mean: file time 0.0038000000000000004
Disk-only mean: file time 0.026599999999999995
Still faster in memory.
That's the very first thing I thought of... what if the code were written in a lower-level language (and not in fucking python or Java!), then made do this task on Windows $latest, OSX $latest, Linux $latest, maybe a resurrected DOS $latest for reference, etc... I mean, it can't be that hard to write this thing in C and port it as needed.
Doesn't seem very scientific at all otherwise. I mean, are they testing memory versus disk, are they testing memory vs. disk performance in a given specific language, or what? Maybe they just needed to flesh out their abstract a bit more to reflect this?
Quo usque tandem abutere, Nimbus, patientia nostra?