Memory Management Technique Speeds Apps By 20%
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.
With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.
And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.
Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...
A witty [sig] proves nothing. --Voltaire
That's already implemented in the high performance ssh patch, available here. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.
I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.
Can You Say Linux? I Knew That You Could.
Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free.
Higher Logics: where programming meets science.
The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.
Game! - Where the stick is mightier than the sword!
They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.
It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.
Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"
In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.